Similar to Levenshtein, Damerau-Levenshtein also calculates the distances between two strings. It based around comparing two string and counting the number of insertions, deletions, and substitution of single characters, and transposition of two characters.
This was, originally, aimed at spell checkers, it is also used for DNA sequences.
Wikipedia entry found be here:
public class DamerauLevenshtein { private String compOne; private String compTwo; private int[][] matrix; private Boolean calculated = false; public DamerauLevenshtein(String a, String b) { if ((a.length() > 0 || !a.isEmpty()) || (b.length() > 0 || !b.isEmpty())) { compOne = a; compTwo = b; } } public int[][] getMatrix() { setupMatrix(); return matrix; } public int getSimilarity() { if (!calculated) setupMatrix(); return matrix[compOne.length()][compTwo.length()]; } private void setupMatrix() { int cost = -1; int del, sub, ins; matrix = new int[compOne.length()+1][compTwo.length()+1]; for (int i = 0; i <= compOne.length(); i++) { matrix[i][0] = i; } for (int i = 0; i <= compTwo.length(); i++) { matrix[0][i] = i; } for (int i = 1; i <= compOne.length(); i++) { for (int j = 1; j <= compTwo.length(); j++) { if (compOne.charAt(i-1) == compTwo.charAt(j-1)) { cost = 0; } else { cost = 1; } del = matrix[i-1][j]+1; ins = matrix[i][j-1]+1; sub = matrix[i-1][j-1]+cost; matrix[i][j] = minimum(del,ins,sub); if ((i > 1) && (j > 1) && (compOne.charAt(i-1) == compTwo.charAt(j-2)) && (compOne.charAt(i-2) == compTwo.charAt(j-1))) { matrix[i][j] = minimum(matrix[i][j], matrix[i-2][j-2]+cost); } } } calculated = true; displayMatrix(); } private void displayMatrix() { System.out.println(" "+compOne); for (int y = 0; y <= compTwo.length(); y++) { if (y-1 < 0) System.out.print(" "); else System.out.print(compTwo.charAt(y-1)); for (int x = 0; x <= compOne.length(); x++) { System.out.print(matrix[x][y]); } System.out.println(); } } private int minimum(int d, int i, int s) { int m = Integer.MAX_VALUE; if (d < m) m = d; if (i < m) m = i; if (s < m) m = s; return m; } private int minimum(int d, int t) { int m = Integer.MAX_VALUE; if (d < m) m = d; if (t < m) m = t; return m; } }
Further to the comments and observations of @zooz (see comments below), I have to apologise and advise that the code above is actually the Optimal String Alignment Distance Algorithm rather than Damerau Levenshtein. Here is the Damerau Levenshtein code in Java:
public int getDHSimilarity() { int res = -1; int INF = compOne.length() + compTwo.length(); matrix = new int[compOne.length()+1][compTwo.length()+1]; for (int i = 0; i < compOne.length(); i++) { matrix[i+1][1] = i; matrix[i+1][0] = INF; } for (int i = 0; i < compTwo.length(); i++) { matrix[1][i+1] = i; matrix[0][i+1] = INF; } int[] DA = new int[24]; for (int i = 0; i < 24; i++) { DA[i] = 0; } for (int i = 1; i < compOne.length(); i++) { int db = 0; for (int j = 1; j < compTwo.length(); j++) { int i1 = DA[compTwo.indexOf(compTwo.charAt(j-1))]; int j1 = db; int d = ((compOne.charAt(i-1)==compTwo.charAt(j-1))?0:1); if (d == 0) db = j; matrix[i+1][j+1] = Math.min(Math.min(matrix[i][j]+d, matrix[i+1][j]+1),Math.min(matrix[i][j+1]+1,matrix[i1][j1]+(i - i1-1)+1+(j-j1-1))); } DA[compOne.indexOf(compOne.charAt(i-1))] = i; } return matrix[compOne.length()][compTwo.length()]; }
Tags: Damerau, distance, Java, Levenshtein
June 28, 2010 at 09:23 |
[…] Damerau-Levenshtein Hamming Jaro Winkler N-Gram Markov […]
April 18, 2011 at 03:37 |
Thanks! This really helped me a lot!
May 16, 2011 at 14:03 |
Hi, I’m doing a subject on similarity algorithms comparing this with cosine similarity and testing which area they are best.
I copied this into eclipse and tried to run it but couldn’t. I tried to make a main method but that didn’t help. I don’t even know how to get the program to see what two documents(wboth with few lines) it should check or what the start method is.
Is there a way to make the algorithm work right? I hope you can help me make it work for my project. ๐
PS: I’m a newbeginner at Java since programming isn’t my major. Though I still need to make a demo and thought I could use this code since it’s in Java which should be easy?.
May 17, 2011 at 19:33 |
Hi there,
The code is for the class, you should create a new class (call it DamerauLevenshtein) in your IDE, paste in my code and then in Main instantiate the class using the default constructor and passing in two strings. Then call the method getSimilarity() from the object. I am on the road at the moment so I don’t have an exact example to hand, but something along these lines:
Try providing the details as defined in the wikipedia example.
Hope this helps?
What are you studying and where?
May 18, 2011 at 10:27
Correction: PS: Iโm a newbeginner at Java though programming should be my major. I need to make a demo and thought I could use this code since itโs in Java which should be easy?.
Sorry, I wonder how tired I was when I wrote this. I’m from a university in Denmark and trying to become an engineer in software technology. Had a bad start with bad book and teacher programming so using this project to learn it. Hoping to reach a level where most good programmers are one day ๐
Thanks for the details they made perfect sense and it worked easily.
I tried to implement the code from wiki too. Found it after I found your blog and posted that comment. Funny how I didn’t think of checking wiki first.
Anyway, you went wrong like me. We both implemented the “OptimalStringAlignmentDistance” version and not the “damerauLevenshteinDistance”. I would try to jump on that one but I’m totally not getting it (pseudocode easy, actionscript not so much). LD(CA,ABC) should 2 because CA -> AC -> ABC but since we are using OSA we both get 3 because of CA -> A -> AB -> ABC.
Of course it’s totally embarassing if I got it wrong but I find it idiotic that they rather post the pseudocode for the wrong algorithm instead of deleting it and putting up a right one.
Still thanks for replying, I’ll keep looking at this blog since it seems interesting xD
On the road? Vacation time or work?
May 30, 2011 at 15:48
Hey Zooz,
I completely missed that, big mistake from my part! Thank you for pointing it out, much appreciated. You will see I have added a method for the Damerau Levenshtein, I have run some basic tests based on the wikipedia entry. I have recently started a new job so my time is really precious at the moment, hence it has taken me so long to respond. Feel free to test, and feed back with any thoughts / comments / or new code? (I don’t mind if that is public or private). Keep going with the studies, you are doing well ๐
May 17, 2011 at 16:04 |
What license is this code released under, public domain?
May 17, 2011 at 19:34 |
As my major source is wikipedia, it is the same license: http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
Feel free to reference my site ๐
July 12, 2011 at 12:47 |
Hello,
In getDHSimilarity() you wrote:
int[] DA = new int[24];
Why 24? is that because the answer is 42 ๐ ?
For a string longer than 24 this throws a ArrayIndexOutOfBoundsException.
July 12, 2011 at 20:06 |
42 is the answer to everything ๐
The method should take a int to size DA, this is something I left in from some testing I was doing.
April 24, 2012 at 18:26 |
Something’s not right with the code. Why do “still” and “chill” have an edit distance of zero?
That would imply they are the same word, which btw is a case that should have been caught early in the algorithm.
April 24, 2012 at 18:35 |
EDIT: woops, try the words “hurt” and “hurl”.
April 24, 2012 at 18:42 |
Line 06 should read:
matrix = new int[compOne.length() + 2][compTwo.length() + 2];
July 5, 2012 at 13:41 |
Changed the array size from 24 to 255, it does not work for me when I compare sets of characters which are >20 in total. For the search “VICARLI” having “VICARLI TRANSPORTE” in the DB doesn’t appear in top results!