Digitally re-remastered anniversary collection

## Coding Interview 14: Counting Characters

Post by: Blake
Posted on: 11 Vigeo 14:0 - 1.51.18

## Count the number of occurences of a given character in a file.

If they ask this question, it's either an easy job you're interviewing for, or it's a lead-in to one of the next questions.
public void int CountOccurences(string path, char character)
{
int count = 0;
for (int i = 0; i < contents.Length; ++i)
{
if (contents[i] == character)
{
++count;
}
}
return count;
}

The complexity of this is linear.

## Find the most common character in a file.

public char FindMostCommonCharacter(string path)
{
Dictionary<charint> counts = new Dictionary<charint>();
char maxChar = '\0';
int maxCharCount = 0;
for (int i = 0; i < contents.Length; ++i)
{
char c = contents[i];
int count = counts.Contains(c) ? counts[c] + 1 : 1;
counts[i] = count;
if (count > maxCharCount)
{
maxChar = c;
}
}
return maxChar;
}

The complexity of this is also linear.

## Find the n most common characters in a file.

public char[] FindMostCommonCharacters(string path, int n)
{
Dictionary<charint> counts = new Dictionary<charint>();
for (int i = 0; i < contents.Length; ++i)
{
char c = contents[i];
int count = counts.Contains(c) ? counts[c] + 1 : 1;
counts[i] = count;
}
char[] characters = new List<char>(counts.Keys).ToArray();
char[] buffer = new char[characters.Length];
characters = MergeSort(characters, counts,
buffer, 0, characters.Length);
n = Math.Min(characters.Length, n);
char[] output = new char[n];
characters.CopyTo(output, 0, n);
}

private void MergeSort(
char[] characters,
Dictionary<charint> counts,
char[] buffer,
int begin, int length)
{
if (length <= 1)
{
return;
}

int middle = begin + length / 2;
int leftLength = middle - begin;
int end = begin + length;
int rightLength = length - leftLength
MergeSort(characters, counts, buffer, begin, leftLength);
MergeSort(characters, counts, buffer, middle, rightLength);

int i = begin;
int left_i = begin;
int right_i = middle;
while (left_i < middle && right_i < end)
{
if (count[characters[left_i]] < count[characters[right_i]])
{
buffer[i++] = characters[left_i++];
}
else
{
buffer[i++] = characters[right_i++];
}
}
while (left_i < middle) buffer[i++] = characters[left_i++];
while (right_i < end) buffer[i++] = characters[right_i++];
buffer.CopyTo(characters, begin, length);
}

It's always good to be able to jot down merge sort without thinking too hard. It's also more impressive if you can write a version that allocates memory linearly rather than n log n. Here, 1 buffer with the same length of the list is created before MergeSort is called and is re-used for all recursive calls.

Also, if n is expected to always be small, then it is better to keep a running total of the top n values while you're counting them rather than sort them at the end. In an interview, state that you are aware of this, but code the sorting method as your final answer anyway.

## Find the 10 most common characters in a freakishly massive amount of data as fast as possible.

Assume "freakishly massive" means several terabytes or more. At this point, it should be obvious that you're writing a distributed system of some sort and they're not expecting you to write a function on a board, but more likely to generally describe such a system.

One example solution is to write a network program where a worker can spawn other worker processes from itself to subdivide data. And those spawned processes can also spawn more helpers until all your computers in your pool are allocated. Each worker runs a program similar to the ones above, but instead serializes the WHOLE dictionary of characters and sends it to its parent that originally spawned the worker. The parents will take the dictionaries of all its children and sum like keys together to create a new dictionary. And then the process continues until the root worker finally has a total sum. Once you have this ultimate dictionary of counts, you can again sort, and then find the top n characters.

If you really want to squeeze out some efficiency, rather than returning dictionaries to its parent, have the workers divide themselves out to handle different ranges of unicode. For example, if you have 256 machines, then bitwise and each key with 255 and that will determine which machine that key will get sent to for tallying. This guarantees that no machine is tallying characters that are also being tallied on another machine. After this is complete, each machine filters its tallies down to its top n characters and one worker machine will union these 256 * n values and sort them (a very small amount of data to sort), and return the top n.