Sunday 8 July 2012

Levenshtein Distance:(How big the social network for a word is?)


Last week I came across this problem on codeeval.com 

Levenshtein Distance 

Description:

Two words are friends if they have a Levenshtein distance of 1 (For details see http://en.wikipedia.org/wiki/Levenshtein_distance). That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Write a program to tell us how big the social network for the word 'hello' is, using this word list https://raw.github.com/codeeval/Levenshtein-Distance-Challenge/master/input_levenshtein_distance.txt

Input sample:

Your program should accept as its first argument a path to a filename.The input file contains the word list. This list is also available at https://raw.github.com/codeeval/Levenshtein-Distance-Challenge/master/input_levenshtein_distance.txt.

Output sample:

Print out how big the social network for the word 'hello' is. e.g. The social network for the word 'abcde' is 4846.
Now there are many ways of solving this as you might find on the internet. Most of them depend on finding the Levenshtein distance for calculating the neighbors of a word. But the problem I found that was it takes O(n2) time to calculate Levenshtein  distance(edit distance using dp) between 2 words. After this it all gets down to doing a DFS starting with 'hello'. This approach takes a lot of time to run. 
My approach for overcoming the problem with calculating the neighbors is as follows. Now we know that in this we need only the words that are at a distance of 1. Now these words are the once that can be obtained either by deleting a character, or replacing a character. So I simply simulated it using for loops and generating all possible words formed by deleting or adding or changing a letter and look it up in dictionary. For efficient storing of dictionary and lookup I used 2 dimensions. ie. An array of HashSet where HashSet at index i stores only words of length i.
You can find the full program here
Here is the logic for getting all words at distance 1:

public static void addwords(String wrd){
	int i,len=wrd.length();
	char z;
	StringBuffer sb;
	String b="";
	for(i=0;i<len;i++){
		sb=new StringBuffer(wrd);
		sb=sb.deleteCharAt(i);b=sb.toString();
		if(h[len-1].contains(b)){
			st.add(b);h[len-1].remove(b);
		}
	}
	for(i=0;i<=len;i++){
		for(z='a';z<='z';z++){
			sb=new StringBuffer(wrd);
			sb=sb.insert(i,z);
			b=sb.toString();
			if(h[len+1].contains(b)){
				st.add(b);h[len+1].remove(b);
			}
		}
	}
	for(i=0;i<len;i++){
		for(z='a';z<='z';z++){
			sb=new StringBuffer(wrd);
			sb.setCharAt(i,z);
			b=sb.toString();
			if(h[len].contains(b)){
				st.add(b);h[len].remove(b);
			}
		}
	}
}

1 comments:

Anonymous said...

I think you should check out http://wordnet.princeton.edu/

Post a Comment