Posts Tagged ‘Levenshtein’

Similarity algorithms

June 28, 2010

I have recently been researching a record linkage techniques, and part of this process I have been reminding myself of certain algorithms, and in other case learning these for the first time. As is my way, I typically try and turn the algorithm into code to allow me to understand and learn it. I have coded up examples of the following algorithms in Java:

Levenshtein
Damerau-Levenshtein
Hamming
Jaro Winkler
N-Gram
Markov Chain

Out of curiosity I decided to publish each one separately so I can see from the stats which is the most popular. In a month or so I will add a histogram of the hits for each one (if there are any!).

I hope this is useful.

Be warned, I offer no warranty or guarantee on this code, any changes / enhancements / corrections / or use of this code should be attributed to itssmee, CHIME, UCL (where I am doing my research) and shared with the community.

Java example of Damerau Levenshtein distance

June 28, 2010

Similar to Levenshtein, Damerau-Levenshtein also calculates the distances between two strings. It based around comparing two string and counting the number of insertions, deletions, and substitution of single characters, and transposition of two characters.

This was, originally, aimed at spell checkers, it is also used for DNA sequences.

Wikipedia entry found be here:


public class DamerauLevenshtein
{
    private String compOne;
    private String compTwo;
    private int[][] matrix;
    private Boolean calculated = false;

    public DamerauLevenshtein(String a, String b)
    {
        if ((a.length() > 0 || !a.isEmpty())  || (b.length() > 0 || !b.isEmpty()))
        {
            compOne = a;
            compTwo = b;
        }        
    }

    public int[][] getMatrix()
    {
        setupMatrix();
        return matrix;
    }

    public int getSimilarity()
    {
        if (!calculated) setupMatrix();

        return matrix[compOne.length()][compTwo.length()];
    }

    private void setupMatrix()
    {
        int cost = -1;
        int del, sub, ins;
        
        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i <= compOne.length(); i++)
        {
            matrix[i][0] = i;
        }

        for (int i = 0; i <= compTwo.length(); i++)
        {
            matrix[0][i] = i;
        }

        for (int i = 1; i <= compOne.length(); i++)
        {
            for (int j = 1; j <= compTwo.length(); j++)
            {
                if (compOne.charAt(i-1) == compTwo.charAt(j-1))
                {
                    cost = 0;
                }
                else
                {
                    cost = 1;
                }

                del = matrix[i-1][j]+1;
                ins = matrix[i][j-1]+1;
                sub = matrix[i-1][j-1]+cost;

                matrix[i][j] = minimum(del,ins,sub);

                if ((i > 1) && (j > 1) && (compOne.charAt(i-1) == compTwo.charAt(j-2)) && (compOne.charAt(i-2) == compTwo.charAt(j-1)))
                {
                    matrix[i][j] = minimum(matrix[i][j], matrix[i-2][j-2]+cost);
                }
            }
        }

        calculated = true;
        displayMatrix();
    }
    
    private void displayMatrix()
    {
        System.out.println("  "+compOne);
        for (int y = 0; y <= compTwo.length(); y++)
        {
            if (y-1 < 0) System.out.print(" "); else System.out.print(compTwo.charAt(y-1));
            for (int x = 0; x <= compOne.length(); x++)
            {
                System.out.print(matrix[x][y]);
            }
            System.out.println();
        }
    }

    private int minimum(int d, int i, int s)
    {
        int m = Integer.MAX_VALUE;

        if (d < m) m = d;
        if (i < m) m = i;
        if (s < m) m = s;

        return m;
    }

    private int minimum(int d, int t)
    {
        int m = Integer.MAX_VALUE;

        if (d < m) m = d;
        if (t < m) m = t;

        return m;
    }
}

Further to the comments and observations of @zooz (see comments below), I have to apologise and advise that the code above is actually the Optimal String Alignment Distance Algorithm rather than Damerau Levenshtein. Here is the Damerau Levenshtein code in Java:

public int getDHSimilarity()
{
        int res = -1;
        int INF = compOne.length() + compTwo.length();

        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i < compOne.length(); i++)
        {
            matrix[i+1][1] = i;
            matrix[i+1][0] = INF;
        }

        for (int i = 0; i < compTwo.length(); i++)
        {
            matrix[1][i+1] = i;
            matrix[0][i+1] = INF;
        }

        int[] DA = new int[24];

        for (int i = 0; i < 24; i++)
        {
            DA[i] = 0;
        }

        for (int i = 1; i < compOne.length(); i++)
        {
            int db = 0;

            for (int j = 1; j < compTwo.length(); j++)
            {

                int i1 = DA[compTwo.indexOf(compTwo.charAt(j-1))];
                int j1 = db;
                int d = ((compOne.charAt(i-1)==compTwo.charAt(j-1))?0:1);
                if (d == 0) db = j;

                matrix[i+1][j+1] = Math.min(Math.min(matrix[i][j]+d, matrix[i+1][j]+1),Math.min(matrix[i][j+1]+1,matrix[i1][j1]+(i - i1-1)+1+(j-j1-1)));
            }
            DA[compOne.indexOf(compOne.charAt(i-1))] = i;
        }
        
        return matrix[compOne.length()][compTwo.length()];
}

Java example of Levenshtein’s distance algorithm

June 28, 2010

The purpose of this algorithm is to measure the difference between two sequences/strings. It is based around the number of changes required to make one string equal to the other.

It is aimed at short strings, it usage is spell checkers, optical character recognition, etc.

Wikipedia entry can be found here:


public class Levenshtein
{
    private String compOne;
    private String compTwo;
    private int[][] matrix;
    private Boolean calculated = false;

    public Levenshtein(String one, String two)
    {
        compOne = one;
        compTwo = two;
    }

    public int getSimilarity()
    {
        if (!calculated)
        {
            setupMatrix();
        }
        return matrix[compOne.length()][compTwo.length()];
    }
    
    public int[][] getMatrix()
    {
        setupMatrix();
        return matrix;
    }

    private void setupMatrix()
    {
        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i <= compOne.length(); i++)
        {
            matrix[i][0] = i;
        }

        for (int j = 0; j <= compTwo.length(); j++)
        {
            matrix[0][j] = j;
        }

        for (int i = 1; i < matrix.length; i++)
        {
            for (int j = 1; j < matrix[i].length; j++)
            {
                if (compOne.charAt(i-1) == compTwo.charAt(j-1))
                {
                    matrix[i][j] = matrix[i-1][j-1];
                }
                else
                {
                    int minimum = Integer.MAX_VALUE;
                    if ((matrix[i-1][j])+1 < minimum)
                    {
                        minimum = (matrix[i-1][j])+1;
                    }

                    if ((matrix[i][j-1])+1 < minimum)
                    {
                        minimum = (matrix[i][j-1])+1;
                    }

                    if ((matrix[i-1][j-1])+1 < minimum)
                    {
                        minimum = (matrix[i-1][j-1])+1;
                    }

                    matrix[i][j] = minimum;
                }
            }
        }
        calculated = true;
        displayMatrix();
    }

    private void displayMatrix()
    {
        System.out.println("  "+compOne);
        for (int y = 0; y <= compTwo.length(); y++)
        {
            if (y-1 < 0) System.out.print(" "); else System.out.print(compTwo.charAt(y-1));
            for (int x = 0; x <= compOne.length(); x++)
            {
                System.out.print(matrix[x][y]);
            }
            System.out.println();
        }
    }
}