Login | Register

Nerd Paradise

Programming is sexy

Coding Interview 14: Counting Characters

Post by: Blake
Posted on: 11 Vigeo 14:0 - 1.51.18

Count the number of occurences of a given character in a file.


If they ask this question, it's either an easy job you're interviewing for, or it's a lead-in to one of the next questions.
public void int CountOccurences(string path, char character)
{
    string contents = File.ReadAllText(path);
    int count = 0;
    for (int i = 0; i < contents.Length; ++i)
    {
        if (contents[i] == character)
        {
            ++count;
        }
    }
    return count;
}

The complexity of this is linear.

Find the most common character in a file.

public char FindMostCommonCharacter(string path)
{
    string contents = File.ReadAllText(path);
    Dictionary<charint> counts = new Dictionary<charint>();
    char maxChar = '\0';
    int maxCharCount = 0;
    for (int i = 0; i < contents.Length; ++i)
    {
        char c = contents[i];
        int count = counts.Contains(c) ? counts[c] + 1 : 1;
        counts[i] = count;
        if (count > maxCharCount)
        {
            maxChar = c;
        }
    }
    return maxChar;
}

The complexity of this is also linear.

Find the n most common characters in a file.

public char[] FindMostCommonCharacters(string path, int n)
{
    string contents = File.ReadAllText(path);
    Dictionary<charint> counts = new Dictionary<charint>();
    for (int i = 0; i < contents.Length; ++i)
    {
        char c = contents[i];
        int count = counts.Contains(c) ? counts[c] + 1 : 1;
        counts[i] = count;
    }
    char[] characters = new List<char>(counts.Keys).ToArray();
    char[] buffer = new char[characters.Length];
    characters = MergeSort(characters, counts,
        buffer, 0, characters.Length);
    n = Math.Min(characters.Length, n);
    char[] output = new char[n];
    characters.CopyTo(output, 0, n);
}

private void MergeSort(
    char[] characters,
    Dictionary<charint> counts,
    char[] buffer,
    int begin, int length)
{
    if (length <= 1)
    {
        return;
    }

    int middle = begin + length / 2;
    int leftLength = middle - begin;
    int end = begin + length;
    int rightLength = length - leftLength
    MergeSort(characters, counts, buffer, begin, leftLength);
    MergeSort(characters, counts, buffer, middle, rightLength);

    int i = begin;
    int left_i = begin;
    int right_i = middle;
    while (left_i < middle && right_i < end)
    {
        if (count[characters[left_i]] < count[characters[right_i]])
        {
            buffer[i++] = characters[left_i++];
        }
        else
        {
            buffer[i++] = characters[right_i++];
        }
    }
    while (left_i < middle) buffer[i++] = characters[left_i++];
    while (right_i < end) buffer[i++] = characters[right_i++];
    buffer.CopyTo(characters, begin, length);
}


It's always good to be able to jot down merge sort without thinking too hard. It's also more impressive if you can write a version that allocates memory linearly rather than n log n. Here, 1 buffer with the same length of the list is created before MergeSort is called and is re-used for all recursive calls.

Also, if n is expected to always be small, then it is better to keep a running total of the top n values while you're counting them rather than sort them at the end. In an interview, state that you are aware of this, but code the sorting method as your final answer anyway.

Find the 10 most common characters in a freakishly massive amount of data as fast as possible.

Assume "freakishly massive" means several terabytes or more. At this point, it should be obvious that you're writing a distributed system of some sort and they're not expecting you to write a function on a board, but more likely to generally describe such a system.

One example solution is to write a network program where a worker can spawn other worker processes from itself to subdivide data. And those spawned processes can also spawn more helpers until all your computers in your pool are allocated. Each worker runs a program similar to the ones above, but instead serializes the WHOLE dictionary of characters and sends it to its parent that originally spawned the worker. The parents will take the dictionaries of all its children and sum like keys together to create a new dictionary. And then the process continues until the root worker finally has a total sum. Once you have this ultimate dictionary of counts, you can again sort, and then find the top n characters.

If you really want to squeeze out some efficiency, rather than returning dictionaries to its parent, have the workers divide themselves out to handle different ranges of unicode. For example, if you have 256 machines, then bitwise and each key with 255 and that will determine which machine that key will get sent to for tallying. This guarantees that no machine is tallying characters that are also being tallied on another machine. After this is complete, each machine filters its tallies down to its top n characters and one worker machine will union these 256 * n values and sort them (a very small amount of data to sort), and return the top n.
facebook twitter Stumbleupon Reddit del.icio.us Digg
User Comments: 2
lucb1e
Post by lucb1e on 11 Vigeo 14:1 - 0.77.25
hmm, I would have expected that first script to count the characters while reading the file, not copy all the file contents to the RAM and then loop through it. Doing that is probably much slower in high(er) level languages than just loading it into the memory, but this way is sort of very easy...
And can't you actually just return content.length?

Edit: Wait, that last question was stupid. Still though, it should be possible to do this more efficiently (without first reading the entire file into the memory).
deckmaster
Post by Deckmaster on 12 Vigeo 11:4 - 13.35.46
I know that this came out a long time ago, but no matter:
I would like to point out that there is a way to squeeze out some efficiency without going for the full "split up unicode" approach. One could use a technique similar to the structure of any sane MPI reduce implementation: treat the network like a binary tree and aggregate in steps. The original method described above takes O(n*m) steps (n parents, m children for each parent). An intelligent reduce method takes O(log(n*m)) steps. There is, of course, still the matter of sorting the resulting dictionary and producing the top N from that. Since communication is generally the most costly operation in a distributed problem, it's entirely possible that this will actually beat splitting up unicode and unioning the 256 n-best-lists, due to the extra communication.

If one desires to further squeeze out efficiency, a similar optimization could be applied to the method of splitting up unicode. In this example, each of the 256 machines would first sort its results and produce the top N. Then a reduce function would be applied to the top-N for each machine: two machines communicate, the one mergesorts the inputs and outputs the top-N from the result up the tree. Now the root doesn't even need to sort a result at the end! The union method will end up taking O(n*log(256)) (communication *and* sorting/trimming) instead of O(256*n) (just communication) plus the sorting of 256*n values. Unfortunately, due to the extra communication I'm not actually sure that this will net any real benefit. I'm almost tempted to do some benchmarks, but I don't actually have access to a cluster of that size.
You must be logged in to add a comment
Current Date: 13 Ineo 10:0Current Time: 17.87.6Join us in IRC...
Server: irc.esper.net
Channel: #nerdparadise
Your IP: 107.22.156.205Browser: UnknownBrowser Version: 0