nelhage debugs shit

lightly-edited tales of debugging systems

Surveying various languages’ string-search algorithms

When Stripe ran our CTF 3.0, I wrote most of level 3, which was a full-text search challenge inspired in part by my own livegrep.

I wrote a naïve implementation, which just looped over the files, read them into memory, and used java.lang.String.contains to check if each file contained the “needle”, and we released that implementation as the baseline implementation that contestants needed to improve on.

I also wrote a solution that used a simple trigram index, which was the solution you had to meet or beat to win.

But in the interests of experimentation, I also wrote another solution, which, during the “indexing” phase, slurped all the files’ text into a big in-memory structure, and used java.lang.String.indexOf to find the matching index, and worked from there. I strongly suspected that, given how comparatively small the corpus was (about 100MB) that this solution should be fast enough to win, and wanted to confirm or deny that hunch.

To my surprise and disappointment, however, it turned out that this approach was barely faster than the one that read each file off the disk each time! Obviously, at least after the first try, the files would be in disk cache, so there wouldn’t be much actual disk I/O in either case, but I had expected that the I/O and filesystem overhead would at least be significant!

After a bit of profiling and experimentation, I began to unhappily form a hypothesis: Maybe java.lang.String’s string-search algorithm was just really inefficient! So, I went to the source, and, after an hour of desperately digging through openjdk.java.net, I found the source code for java.lang.String and discovered the horrible truth:

/**
 * Code shared by String and StringBuffer to do searches. The
 * source is the character array being searched, and the target
 * is the string being searched for.
 *
 * @param   source       the characters being searched.
 * @param   sourceOffset offset of the source string.
 * @param   sourceCount  count of the source string.
 * @param   target       the characters being searched for.
 * @param   targetOffset offset of the target string.
 * @param   targetCount  count of the target string.
 * @param   fromIndex    the index to begin searching from.
 */
static int indexOf(char[] source, int sourceOffset, int sourceCount,
        char[] target, int targetOffset, int targetCount,
        int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    char first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first);
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j]
                    == target[k]; j++, k++);

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

in other words, all the substring-search methods on java.lang.String use essentially the most naïve string-match algorithm imaginable, which will run to O(n*m) runtime in the worst case!

It turns out that virtually every other language I know of uses an optimized string-search by default, which had the upshot that simply rewriting our Scala code in Ruby(!) would actually make the code dramatically faster and pass our benchmarks! “Oops”.

But inspired by this, I decided to go do a brief survey of common language/VM string-search algorithms, and here’s the results:

Lang/runtimeAlgorithmreference
JavaNaïve[source]
glibc strstrtwo-way, with a Boyer-Moore shift table for long needles[source]
golangRabin-Karp for strings.Index, naïve for bytes.Index, Boyer-Moore for strings.Replacerstrings.Index bytes.Index
PythonTweaked Boyer-Moore-Horspool[writeup] [source]
Ruby 1.8Rabin-Karp[source]
Ruby 1.9+Something homegrown with a shift table.1[source]
v8Boyer-Moore-Horspool, with some special-cases for short needles.[source]

So it definitely looks like Boyer-Moore or Boyer-Moore-Horspool are winners among the runtimes that have put serious thought into their implementation, although unsurprisingly you need to tweak the pure algorithm to get performance in practice. Rabin-Karp is super-simple to implement and is probably a performance win over the naïve approach (but can’t really compete with anything that uses a good shift table), so it doesn’t surprise me that it shows up a few times, but not in any of heavier-weight implementations. And Java really is unique in having given no algorithmic effort to their basic string-search – surprisingly to me, given the incredibly sophistication of the JVM and Java standard libraries in general.

One question I haven’t answered and probably won’t bother to (but would love to see a good writeup of!) is how well all the optimized algorithms stack up against each other, across a wide range of needle and haystack sizes, with various types of partial matches, and so on.

1 I don’t recognize Ruby 1.9’s string-search algorithm as having a standard name. I initially thought it was a tweaked Knuth-Morris-Pratt, but critically, unlike KMP, the outer loop doesn’t take into account where or how the memcmp fails (thanks to Greg Price for pointing this out to me). The difference is key: While KMP is guaranteed O(n+m) time, you can pretty easily drive this algorithm to be quadratic:

$ pry
[1] pry(main)> 12.upto(22).map do |i|
[1] pry(main)*   needle='a'*(2**(i-1)) + 'b'
[1] pry(main)*   haystack = 'a' * (2**i)
[1] pry(main)*   a = Time.now; haystack.include?(needle); b=Time.now
[1] pry(main)*   t = b-a
[1] pry(main)*   t, t/haystack.length**2
[1] pry(main)* end
=> [[0.000150217, 8.953630924224853e-12],
 [0.000434998, 6.481975317001343e-12],
 [0.001405982, 5.237691104412079e-12],
 [0.00521789, 4.859538748860359e-12],
 [0.02278149, 5.304228980094195e-12],
 [0.091578843, 5.330590240191668e-12],
 [0.397198244, 5.779995175544172e-12],
 [1.651885885, 6.0095258413639385e-12],
 [8.639180912, 7.857289267121815e-12],
 [44.239053054, 1.005879609101612e-11],
 [265.964429048, 1.5118327442451118e-11]]

(p.s. if you’re trying to demonstrate quadratic behavior, you know you’re onto something as soon as you have to go get coffee waiting on your microbenchmark)