Monday, June 29, 2009

Suffix Trees: Longest Common Substring and Diff Implementation

The Longest Common Substring (LCS) is the longest string that is a substring of two or more strings.
LCS could be found with using of Generalized Suffix Tree (GST).

Steps to find LCS:
  1. build generalized suffix tree:
    1. concatenate original strings;
    2. build suffix tree.
  2. calculate special status for each node of the generalized suffix tree:
    • original strings labels;
    • tree height or suffix length.
  3. find longest common substring - traverse the tree to find a node with maximal tree height and labeled with all original strings.

Download source code of Longest Common Substring and Diff Implementation.

Class diagram:

Helper Inner classes:
  • LCSNodeStatus - Suffix Tree Node Status for Longest Common Substring (LCS)
  • CommonSubstr - Result of searching Longest Common Substring. Also represents partial diff result.
Nodes statuses:

Node statuses could be calculated:
Map<Node, LCSNodeStatus> statuses = new HashMap<Node, LCSNodeStatus>();
getLCSNodeStatus(suffixTree.getRootNode(), 0, statuses);

where
private LCSNodeStatus getLCSNodeStatus(Node node, int height, Map<Node, LCSNodeStatus> statuses) {
LCSNodeStatus nodeStatus = new LCSNodeStatus(node, height);
if (node.getEdges().size() == 0) {
return nodeStatus;
}

for (Edge edge : node.getEdges()) {
LCSNodeStatus status = getLCSNodeStatus(edge.getEndNode(),
height + getEdgeHeightForNodeStatus(edge), statuses);

status.addString(getEdgeStringNumber(edge));
nodeStatus.mergeStatus(status);
}
statuses.put(node, nodeStatus);
return nodeStatus;
}

Find LCS:
private CommonSubstr getLcs(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
int max = 0;
int foundBeginIndexes[] = null;

for (LCSNodeStatus status : statuses.values()) {
if ((status.getHeight() > 0) && (status.isAllStrings()) && (max <= status.getHeight())) {
Node node = status.getNode();
int workingBeginIndexes[] = initFoundBeginIndexes();

updateFoundBeginIndexes(beginIndexes, endIndexes, node, status.getHeight(),
statuses, workingBeginIndexes);

if (verifyFoundBeginIndexes(workingBeginIndexes)) {
foundBeginIndexes = workingBeginIndexes;
max = status.getHeight();
}
}
}

if (foundBeginIndexes == null)
return null;

return new CommonSubstr(foundBeginIndexes, max);
}

private int[] initFoundBeginIndexes() {
int beginIndexes[] = new int[texts.length];
for(int i=0; i<texts.length; i++)
beginIndexes[i] = Integer.MAX_VALUE;
return beginIndexes;
}

private void updateFoundBeginIndexes(int beginIndexes[], int endIndexes[], Node node, int height,
Map<Node, LCSNodeStatus> statuses, int[] foundBeginIndexes) {
for (Edge edge : node.getEdges()) {
LCSNodeStatus nodeStatus = statuses.get(edge.getEndNode());
if ((nodeStatus != null) && nodeStatus.isAllStrings()) {
updateFoundBeginIndexes(beginIndexes, endIndexes, edge.getEndNode(),
height + getEdgeHeight(edge), statuses, foundBeginIndexes);
} else {
int stringNumber = getEdgeStringNumber(edge);
int beginIndex = edge.getBeginIndex() - height;

if ((beginIndex < endIndexes[stringNumber]) &&
(beginIndex >= beginIndexes[stringNumber]) &&
(foundBeginIndexes[stringNumber] > beginIndex)) {

foundBeginIndexes[stringNumber] = beginIndex;
}
}
}
}

private boolean verifyFoundBeginIndexes(int[] beginIndexes) {
for(int i=0; i<texts.length; i++)
if (beginIndexes[i] == Integer.MAX_VALUE)
return false;
return true;
}

Diff Implementation:
  • find LCS;
  • call diff for left part of strings (before LCS);
  • call diff for right part of string (after LCS);
  • join results.

To define strings search parts 2 arrays are used: beginIndexes[], endIndexes[].
private List<CommonSubstr> diff(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
CommonSubstr commonSubstr = getLcs(beginIndexes, endIndexes, statuses);
if (commonSubstr == null)
return new LinkedList<CommonSubstr>();

List<CommonSubstr> prev = diff(beginIndexes, commonSubstr.beginIndexes, statuses);
List<CommonSubstr> next = diff(incIndexes(commonSubstr.endIndexes), endIndexes, statuses);

prev.add(commonSubstr);
if (next != null)
prev.addAll(next);
return prev;
}

See also: my other Suffix Trees posts

4 comments:

Bobi said...

Longest substring doesent seem to work as expected.
Example:
String[] fourChars={"a","b","c","d"};
String [] genes= new String[100];
for(int j=0;j<100;j++){
for(int i=0;i<1000;i++){
genes[j]+=fourChars[(int)(Math.random()*4)];
}
System.out.println(genes[j]);
}
GeneralizedSuffixTree gst = new GeneralizedSuffixTree(genes);
System.out.println(gst.getLcsAsString());

rakesh said...

once suffix tree is created, how much is the time complexity to get the LCS. I read somewhere it is O(n). But I do not know how to achieve that. Can you please help me?

Pixar said...

This code also throws exception:
String[] fourChars={"a","b","c","d"};
String[] genes= {"", "", "", "", "", "", "", "", "", ""};
for(int j = 0; j < genes.length; j++) {
for(int i = 0; i < 2 ; i++){
genes[j] += fourChars[(int)(Math.random() * 4)];
}
System.out.print(j);
System.out.println(": " + genes[j]);
}
char[] terminators = {'#', '~', '@', '%', '^', '&', '$', '*', '(', '_'};
GeneralizedSuffixTree gst = new GeneralizedSuffixTree(genes, terminators);
System.out.println(gst.getLcsAsString());

Akip Maulana said...

I can't download source code LCS now. could you reupload the file attachment in another online repository, please ?