On the Origins and Applications of Substring Matching

by | Apr 18, 2016

Today I’m going to be looking into substring matching algorithms. The benchmark appears to be the Boyer-Moore algorithm. It actually seems to be pretty commonly used for search and replace in text editors. In fact it is said to be the most efficient string-matching algorithm for natural language and for fixed search strings.

So just how did it appear? Two guys at the University of Texas at Austin,

Boyer and Moore, published a paper in 1977 titled, “A Fast String Searching Algorithm”. it showed and proved that searching could come in at sub-linear efficiency at best O( N/M ) , and at worst linear, the same as the brute-force method.

One example of the speed of Boyer-Moore is seen in the GNU implementation of grep vs. the FreeBSD implementation.

“GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.”(4).

Check out the code below where the Grep source uses Boyer-Moore.

It’s located in the kwset.c file and supposed to search for any of a set of keywords.

 

/* Check if we can use the simple boyer-moore algorithm, instead of the hairy commentz-walter algorithm. */

if (kwset->words == 1 && kwset->trans == NULL)

{

char c;

/* Looking for just one string. Extract it from the trie. */

kwset->target = obstack_alloc(&kwset->obstack, kwset->mind);

if (!kwset->target)

return _("memory exhausted");

for (i = kwset->mind - 1, curr = kwset->trie; i >= 0; --i)

{

kwset->target[i] = curr->links->label;

curr = curr->links->trie;

}

/* Build the Boyer Moore delta. Boy that's easy compared to CW. */

for (i = 0; i < kwset->mind; ++i)

delta[U(kwset->target[i])] = kwset->mind - (i + 1);

/* Find the minimal delta2 shift that we might make after

a backwards match has failed. */

c = kwset->target[kwset->mind - 1];

for (i = kwset->mind - 2; i >= 0; --i)

if (kwset->target[i] == c)

break;

kwset->mind2 = kwset->mind - (i + 1);

}

 

Basic Steps:

  • Match the last character of the pattern to the text, from the left to the right.
  • If a letter doesn’t match, move the pattern by the number of characters you haven’t checked.
  • There are two heuristics to follow when implementing BM.
  • The Bad character rule, identifies the rightmost character in the text that doesn’t match. When a bad character is encountered then the proposed shift increment is equal to the distance to the next character in the pattern that does match. Of course, this utilizes a table of the pattern’s character distances.
  • The Good Suffix rule is where we build a table to store the matched suffixes of multiple character matches and when a character doesn’t match then we increment by the length required to achieve a match of the sub-pattern.
  • After the heuristics have returned a shift increment, the larger one is chosen and the matching continues.

 

So part of what makes the algorithm faster is that it takes advantage of information that the naive string matching ignores. It utilizes the information from the shift operation to jump characters which don’t match.

An interesting side note is that after describing the fast string search algorithm Boyer and Moore used their their automated theorem prover to verify the Boyer-Moore efficient string search algorithm and other investigators have used the system for many other important proofs. Some interesting ones are the inevitability of RSA encryption, the incompleteness theorem, and the halting problem for Lisp.

It’s important to keep in mind the performance characteristics of BM, one, is that with English Language texts it operates fast, but with repetitive patterns such as DNA sequences it slows down. The other main feature is the speed up that occurs with longer patterns, which allows larger skipping of characters.

There is a multitude of variants to the basic BM. Many of which take letter frequency into account when checking the next letter. So instead of just looking at each letter from right to left, you would check each letter of the pattern by lowest frequency first. Boyer-Moore-Horspool is another variant which only uses the bad character rule. The purpose was to show how to take advantage of the Boyer-Moore algorithm even though for short substrings below 6 characters, the built-in search instructions in the hardware are fastest. A threshold check is performed to determine when to use the Boyer-Moore algorithm, and in computers without the built-in search, the simplified Boyer-Moore algorithm is recommended.

Below is a table that Horspool produced to compare the machine methods vs Boyer-Moore and Simplified Boyer-Moore, it is particularly revealing about the characteristics of the Boyer-Moore vs the built in scan first character methods.

data-table

Recent Posts