Java example of Damerau Levenshtein distance

Similar to Levenshtein, Damerau-Levenshtein also calculates the distances between two strings. It based around comparing two string and counting the number of insertions, deletions, and substitution of single characters, and transposition of two characters.

This was, originally, aimed at spell checkers, it is also used for DNA sequences.

Wikipedia entry found be here:


public class DamerauLevenshtein
{
    private String compOne;
    private String compTwo;
    private int[][] matrix;
    private Boolean calculated = false;

    public DamerauLevenshtein(String a, String b)
    {
        if ((a.length() > 0 || !a.isEmpty())  || (b.length() > 0 || !b.isEmpty()))
        {
            compOne = a;
            compTwo = b;
        }        
    }

    public int[][] getMatrix()
    {
        setupMatrix();
        return matrix;
    }

    public int getSimilarity()
    {
        if (!calculated) setupMatrix();

        return matrix[compOne.length()][compTwo.length()];
    }

    private void setupMatrix()
    {
        int cost = -1;
        int del, sub, ins;
        
        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i <= compOne.length(); i++)
        {
            matrix[i][0] = i;
        }

        for (int i = 0; i <= compTwo.length(); i++)
        {
            matrix[0][i] = i;
        }

        for (int i = 1; i <= compOne.length(); i++)
        {
            for (int j = 1; j <= compTwo.length(); j++)
            {
                if (compOne.charAt(i-1) == compTwo.charAt(j-1))
                {
                    cost = 0;
                }
                else
                {
                    cost = 1;
                }

                del = matrix[i-1][j]+1;
                ins = matrix[i][j-1]+1;
                sub = matrix[i-1][j-1]+cost;

                matrix[i][j] = minimum(del,ins,sub);

                if ((i > 1) && (j > 1) && (compOne.charAt(i-1) == compTwo.charAt(j-2)) && (compOne.charAt(i-2) == compTwo.charAt(j-1)))
                {
                    matrix[i][j] = minimum(matrix[i][j], matrix[i-2][j-2]+cost);
                }
            }
        }

        calculated = true;
        displayMatrix();
    }
    
    private void displayMatrix()
    {
        System.out.println("  "+compOne);
        for (int y = 0; y <= compTwo.length(); y++)
        {
            if (y-1 < 0) System.out.print(" "); else System.out.print(compTwo.charAt(y-1));
            for (int x = 0; x <= compOne.length(); x++)
            {
                System.out.print(matrix[x][y]);
            }
            System.out.println();
        }
    }

    private int minimum(int d, int i, int s)
    {
        int m = Integer.MAX_VALUE;

        if (d < m) m = d;
        if (i < m) m = i;
        if (s < m) m = s;

        return m;
    }

    private int minimum(int d, int t)
    {
        int m = Integer.MAX_VALUE;

        if (d < m) m = d;
        if (t < m) m = t;

        return m;
    }
}

Further to the comments and observations of @zooz (see comments below), I have to apologise and advise that the code above is actually the Optimal String Alignment Distance Algorithm rather than Damerau Levenshtein. Here is the Damerau Levenshtein code in Java:

public int getDHSimilarity()
{
        int res = -1;
        int INF = compOne.length() + compTwo.length();

        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i < compOne.length(); i++)
        {
            matrix[i+1][1] = i;
            matrix[i+1][0] = INF;
        }

        for (int i = 0; i < compTwo.length(); i++)
        {
            matrix[1][i+1] = i;
            matrix[0][i+1] = INF;
        }

        int[] DA = new int[24];

        for (int i = 0; i < 24; i++)
        {
            DA[i] = 0;
        }

        for (int i = 1; i < compOne.length(); i++)
        {
            int db = 0;

            for (int j = 1; j < compTwo.length(); j++)
            {

                int i1 = DA[compTwo.indexOf(compTwo.charAt(j-1))];
                int j1 = db;
                int d = ((compOne.charAt(i-1)==compTwo.charAt(j-1))?0:1);
                if (d == 0) db = j;

                matrix[i+1][j+1] = Math.min(Math.min(matrix[i][j]+d, matrix[i+1][j]+1),Math.min(matrix[i][j+1]+1,matrix[i1][j1]+(i - i1-1)+1+(j-j1-1)));
            }
            DA[compOne.indexOf(compOne.charAt(i-1))] = i;
        }
        
        return matrix[compOne.length()][compTwo.length()];
}
Advertisements

Tags: , , ,

14 Responses to “Java example of Damerau Levenshtein distance”

  1. Similarity algorithms « It's Smee Blog Says:

    […] Damerau-Levenshtein Hamming Jaro Winkler N-Gram Markov […]

  2. Victor Says:

    Thanks! This really helped me a lot!

  3. zooz Says:

    Hi, I’m doing a subject on similarity algorithms comparing this with cosine similarity and testing which area they are best.

    I copied this into eclipse and tried to run it but couldn’t. I tried to make a main method but that didn’t help. I don’t even know how to get the program to see what two documents(wboth with few lines) it should check or what the start method is.

    Is there a way to make the algorithm work right? I hope you can help me make it work for my project. ๐Ÿ™‚

    PS: I’m a newbeginner at Java since programming isn’t my major. Though I still need to make a demo and thought I could use this code since it’s in Java which should be easy?.

    • itssmee Says:

      Hi there,

      The code is for the class, you should create a new class (call it DamerauLevenshtein) in your IDE, paste in my code and then in Main instantiate the class using the default constructor and passing in two strings. Then call the method getSimilarity() from the object. I am on the road at the moment so I don’t have an exact example to hand, but something along these lines:

      Main
      {
            DamerauLevenshtein t = new DamerauLevenshtein("string1","string2");
      
            int res = t.getSimilarity();
      
           //res should hold an int value of the similarity.
      }
      

      Try providing the details as defined in the wikipedia example.

      Hope this helps?

      What are you studying and where?

      • zooz Says:

        Correction: PS: Iโ€™m a newbeginner at Java though programming should be my major. I need to make a demo and thought I could use this code since itโ€™s in Java which should be easy?.

        Sorry, I wonder how tired I was when I wrote this. I’m from a university in Denmark and trying to become an engineer in software technology. Had a bad start with bad book and teacher programming so using this project to learn it. Hoping to reach a level where most good programmers are one day ๐Ÿ˜€

        Thanks for the details they made perfect sense and it worked easily.

        I tried to implement the code from wiki too. Found it after I found your blog and posted that comment. Funny how I didn’t think of checking wiki first.

        Anyway, you went wrong like me. We both implemented the “OptimalStringAlignmentDistance” version and not the “damerauLevenshteinDistance”. I would try to jump on that one but I’m totally not getting it (pseudocode easy, actionscript not so much). LD(CA,ABC) should 2 because CA -> AC -> ABC but since we are using OSA we both get 3 because of CA -> A -> AB -> ABC.

        Of course it’s totally embarassing if I got it wrong but I find it idiotic that they rather post the pseudocode for the wrong algorithm instead of deleting it and putting up a right one.

        Still thanks for replying, I’ll keep looking at this blog since it seems interesting xD

        On the road? Vacation time or work?

      • itssmee Says:

        Hey Zooz,

        I completely missed that, big mistake from my part! Thank you for pointing it out, much appreciated. You will see I have added a method for the Damerau Levenshtein, I have run some basic tests based on the wikipedia entry. I have recently started a new job so my time is really precious at the moment, hence it has taken me so long to respond. Feel free to test, and feed back with any thoughts / comments / or new code? (I don’t mind if that is public or private). Keep going with the studies, you are doing well ๐Ÿ™‚

  4. Josh Says:

    What license is this code released under, public domain?

  5. mat Says:

    Hello,
    In getDHSimilarity() you wrote:
    int[] DA = new int[24];

    Why 24? is that because the answer is 42 ๐Ÿ˜‰ ?

    For a string longer than 24 this throws a ArrayIndexOutOfBoundsException.

    • itssmee Says:

      42 is the answer to everything ๐Ÿ™‚

      The method should take a int to size DA, this is something I left in from some testing I was doing.

  6. tanis.control Says:

    Something’s not right with the code. Why do “still” and “chill” have an edit distance of zero?

    That would imply they are the same word, which btw is a case that should have been caught early in the algorithm.

  7. tanis.control Says:

    Line 06 should read:

    matrix = new int[compOne.length() + 2][compTwo.length() + 2];

  8. Aritz Says:

    Changed the array size from 24 to 255, it does not work for me when I compare sets of characters which are >20 in total. For the search “VICARLI” having “VICARLI TRANSPORTE” in the DB doesn’t appear in top results!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: