Wednesday, August 31, 2011

Suffix tree: an illustration of the algorithm

I recently did some study on suffix tree, and found Ukkonen's linear time online algorithm quite hard to understand. Mark Nelson has a good article about this, however, in my opinion the writing is not so great and some part is confusing.

For example, the active point and end point concepts are very important. It is defined as 'the first non-leaf node'. However it is not actually a node when you are looking at the tree, because there may be some 'implicit nodes', which you cannot find anywhere when you draw a tree. Indeed, it should be viewed from the suffix's point.

I have to read his source code to completely understand how it works. To better demonstrate, I designed the following example to illustrate the algorithm.
This is how the algorithm works step by step when the input string is 'cacao'. Hope this helps.


20110831-esemmwq7yh2ks751s5atwuw39i.png

Other useful resource include this tutorial and online suffix tree computation form. Here's an excellent java applet that can generate the suffix tree on the fly. Finally, another detailed tutorial is given by Prof Sartaj Sahni, but I found the presentation introduced lots of symbols and quite hard to follow. A lecture note from Prof. Shaojie Zhang is written clearly, though not many details given.

Here is some points that I want to clarify about the algorithm and suffix tree, assuming you know how a suffix tree looks like:
- Intuitively, suffix tree is a compressed trie. You can construct it trivially by inserting each suffix one by one into a trie tree, then compress it. However, this takes quadratic time.
- To speed up, people proposed different approach. A little history from wikipedia:

The concept was first introduced as a position tree by Weiner in 1973, which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976, and also by Ukkonen in 1995. Ukkonen provided the first online-construction of suffix trees, now known as Ukkonen's algorithm, with running time that matched the then fastest algorithms. 
Ukkonen is the easiest to understand and it is on-line. Most people should be using this one.

- The basic idea of the algorithm is, given an string S[1..n], we process the characters one by one. We extend the previously inserted suffix by adding the current processing character to them. For example, suppose we are given 'cacao'. When we are processing 'o', the existing suffix is 'caca', 'aca', 'ca' and 'a'. We want to append 'o' to each of them. 
- For character S[i+1], the current suffices are S[j..i], where j=1..i. In the current tree we trace a path that contains substring S[j..i]. Note that this path may stop at an edge (implicit node), an explicit node or leaf node. There are basically three case when you want to insert a character for suffix S[j..i],
 1. There is already a path from the end of S[j..i] that starts with S[i+1], or S[j..i+1] corresponds to a path. We do nothing. This in fact introduce an implicit node.
 2. S[j..i] ends at a leaf, simply append S[i+1]. 
 3. No path from the end of S[j..i] starts with S[i+1]. Split the edge at the branching position, and add a new leaf with S[i+1]. Note that this is the ONLY case that will introduce new leaf.
- However, the above method has to go through all suffices at each iteration, still quadratic time.
- The speed up and memory reduction come from the following important observations and techniques:
 1. We only have to consider the suffices between the Active Point and End point at each iteration. It can be shown that suffices outside this range are already at leaf nodes, and hence require do nothing (See Case 1). Furthermore, the end point will be the active point at the next iteration.
 2. We introduce a suffix pointer for each internal node (non-leaf non-root) node, which points to a node that contains the Longest Proper Suffix of the current suffix. For example, the longest proper suffix of "cacao" is "acao". With this technique, we don't need to search for the node that contains the next suffix from root every time, instead, we jump from the just processed one to the next one, e.g., from "cacao" to "acao".
 3. We don't need to actually store the substrings in the nodes, we just need to know the index of the starting and ending characters. Hence, we actually do nothing when encounters Rule #2.
 4. We can use hash table to store the edges, which can help inserting, removing and finding edges in constant time. The details can be found in Mark's algorithm implementation, and the comparison in the wiki.
- Intuitively, the active point is the first suffix that requires you to do something, that is, when encounters Rule #3. The end point is the first suffix that you should do nothing again, i.e., when we encounters Rule #1 or Rule #2 again when processing the active point during the iteration.
- To avoid a suffix being a prefix of another suffix, the last character in the string must be unique. In general, we can just append a special character that is not in the vocabulary, e.g., the '$' used in the example. For example, 'book' does not need to append since 'k' is unique, but 'banana' needs since 'a' is not unique. In this case, after the algorithm terminates, there will be exactly |S| + 1 leaves in the tree, one for each suffix, or, all the suffices end at leaf!

Important application of suffix tree includes:
- Find a substring P in S in O(|P|) time. Note that other linear algorithm such as KMP requires O(|P|+|S|). When there are multiple Ps, suffix tree is more efficient.
- Find the longest common substring (LCS) between P and S. We can construct a suffix tree for string P#S\$, where '#' and '\$' are two characters that are not in the vocabulary. To find the LCS, we traverse the suffix tree and sum up the number of characters visited for each internal node. Then, an internal node that has the largest sum and has at least one leaf node that represents the suffix of P and S, respectively, is the desired node. This can be done in O(|P|+|S|) time. Note that the typical solution, i.e., dynamic programming, requires O(|P||S|) time. This can be extended to a more general case where there are multiple strings S1, S2, ..., Sn. And the runtime is the sum of the lengths of the strings.
- Find the longest palindrome in S. Here is one excellent article in Chinese. The basic idea is, let P be the reverse of S, construct suffix tree for S#P$. Then for each S[i], find the largest Least Common Ancestor of S[i], S[n-i+1] (even length) and S[i+1], S[n-i+1] (odd length). LCA can be done in linear time using Tarjan's algorithm.
- Given a set of strings S1, S2, ..., Sn, find all the Si such that Si contains pattern P. This is a typical application in computational biology. The set of strings are gene banks, the pattern is a gene segment. Let T=S1#S2#...#Sn$. This is called a multiple string suffix tree. To find all the Si that contains P, we walk down from root according to P. If we cannot complete a path that represents P, no string contains P. Otherwise, we will stop at a node N, then from its leaves we can know what strings Si contains P. A special case is that N is a leaf node, which means only one string contains P. To speed up, we can maintain a list of indices of Si for each node, which lists all the Si that are a start point of the suffices in the leaves.
- Refer to this and this for more detail.