Posts Tagged ‘distance’

Statistics on the similarity algorithms

July 27, 2010

So, just under a month ago I posted the following on similarity algorithms.

I have found it quite interesting to see which have been viewed, and the percentage share. Initially Damerau Levenshtein was the hot favourite, but then N-gram started coming through the ranks and is a clear leader with a 39% share of the hits!

I thought I would put together a simple histogram of the percentage share of hit for each similarity posting:

Disappointed my favourite, Markov chains, is languishing so far back…

Java example of N-gram

June 28, 2010

An algorithm to calculate the probability of the next term based on the previous n terms.

They are used in speech recognition, phonemes, language recognition etc.

Wikipedia entry can be found here. Wikipedia, in this case, is a little limited on either pseudo code or an algorithm. In this case I referenced the following presentation (Slides 27/28) from CRULP talking about different algorithms used in spell checkering. Also, further reading can be found in Chapter 6 of Christopher D Manning and Hinrich Schutze’s Foundations of Statistical Natural Language Processing.

Here is the code:


public class Ngram
{
    private class result
    {
        private String theWord;
        private int theCount;

        public result(String w, int c)
        {
            theWord = w;
            theCount = c;
        }

        public void setTheCount(int c)
        {
            theCount = c;
        }

        public String getTheWord()
        {
            return theWord;
        }

        public int getTheCount()
        {
            return theCount;
        }
    }

    private List<result> results;

    public Ngram()
    {
        results = new ArrayList<result>();
    }
    public Ngram(String str, int n)
    {

    }

    public double getSimilarity(String wordOne, String wordTwo, int n)
    {
        List<result> res1 = processString(wordOne, n);
        //displayResult(res1);
        List<result> res2 = processString(wordTwo, n);
        //displayResult(res2);
        int c = common(res1,res2);
        int u = union(res1,res2);
        double sim = (double)c/(double)u;

        return sim;
    }

    private int common(List<result> One, List<result> Two)
    {
        int res = 0;

        for (int i = 0; i < One.size(); i++)
        {
            for (int j = 0; j < Two.size(); j++)
            {
                if (One.get(i).theWord.equalsIgnoreCase(Two.get(j).theWord)) res++;
            }
        }

        return res;
    }

    private int union(List<result> One, List<result> Two)
    {
        List<result> t = One;

        for (int i = 0; i < Two.size(); i++)
        {
            int pos = -1;
            boolean found = false;
            for (int j = 0; j < t.size() && !found; j++)
            {
                if (Two.get(i).theWord.equalsIgnoreCase(t.get(j).theWord))
                {
                    found = true;
                }
                pos = j;
            }

            if (!found)
            {
                result r = Two.get(i);
                t.add(r);
            }
        }

        return t.size();
    }

    private List<result> processString(String c, int n)
    {
        List<result> t = new ArrayList<result>();

        String spacer = "";
        for (int i = 0; i < n-1; i++)
        {
            spacer = spacer + "%";
        }
        c = spacer + c + spacer;
        
        for (int i = 0; i < c.length(); i++)
        {
            if (i <= (c.length() - n))
            {
                if (contains(c.substring(i, n+i)) > 0)
                {
                    t.get(i).setTheCount(results.get(i).getTheCount()+1);
                }
                else
                {
                    t.add(new result(c.substring(i,n+i),1));
                }
            }
        }
        return t;
    }

    private int contains(String c)
    {
        for (int i = 0; i < results.size(); i++)
        {
            if (results.get(i).theWord.equalsIgnoreCase(c))
                return i;
        }
        return 0;
    }

    private void displayResult(List<result> d)
    {
        for (int i = 0; i < d.size(); i++)
        {
            System.out.println(d.get(i).theWord+" occurred "+d.get(i).theCount+" times");
        }
    }
}

Java example of Hamming distance

June 28, 2010

This algorithm calculates the distance between two strings, however they have to be of equal length.

It measures the minimum number of substitutions for the two strings to be equal.

It is used in telecommunication (also know as signal distance), it is also used in systematics as a measure of genetic distance.

Wikipedia entry can be found here:


public class Hamming
{
    private String compOne;
    private String compTwo;

    public Hamming(String one, String two)
    {
        compOne = one;
        compTwo = two;
    }

    ///
    //  Calculating the Hamming Distance for two strings requires the string to be of the same length.
    ///
    public int getHammingDistance()
    {
        if (compOne.length() != compTwo.length())
        {
            return -1;
        }

        int counter = 0;

        for (int i = 0; i < compOne.length(); i++)
        {
            if (compOne.charAt(i) != compTwo.charAt(i)) counter++;
        }

        return counter;
    }

    ///
    //  Hamming distance works best with binary comparisons, this function takes a string arrary of binary
    //  values and returns the minimum distance value
    ///
    public int minDistance(String[] numbers)
    {
        int minDistance = Integer.MAX_VALUE;

        if (checkConstraints(numbers))
        {
            for (int i = 1; i < numbers.length; i++)
            {
                int counter = 0;
                for (int j = 1; j <= numbers[i].length(); j++)
                {
                    if (numbers[i-1].charAt(j-1) != numbers[i].charAt(j-1))
                    {
                        counter++;
                    }
                }

                if (counter == 0) return counter;
                if (counter < minDistance) minDistance = counter;
            }
        }
        else
        {
            return -1;
        }

        return minDistance;
    }

    private Boolean checkConstraints(String[] numbers)
    {
        if (numbers.length > 1 && numbers.length <=50)
        {
            int prevLength = -1;
            for (int i = 0; i < numbers.length; i++)
            {
                if (numbers[i].length() > 0 && numbers[i].length() <= 50)
                {
                    if (prevLength == -1)
                    {
                        prevLength = numbers[i].length();
                    }
                    else
                    {
                        if (prevLength != numbers[i].length())
                        {
                            return false;
                        }
                    }

                    for (int j = 0; j < numbers[i].length(); j++)
                    {
                        if (numbers[i].charAt(j) != '0' && numbers[i].charAt(j) != '1')
                        {
                            return false;
                        }
                    }
                }
                else
                {
                    return false;
                }
            }
        }
        else
        {
            return false;
        }

        return true;
    }
}

Java example of Damerau Levenshtein distance

June 28, 2010

Similar to Levenshtein, Damerau-Levenshtein also calculates the distances between two strings. It based around comparing two string and counting the number of insertions, deletions, and substitution of single characters, and transposition of two characters.

This was, originally, aimed at spell checkers, it is also used for DNA sequences.

Wikipedia entry found be here:


public class DamerauLevenshtein
{
    private String compOne;
    private String compTwo;
    private int[][] matrix;
    private Boolean calculated = false;

    public DamerauLevenshtein(String a, String b)
    {
        if ((a.length() > 0 || !a.isEmpty())  || (b.length() > 0 || !b.isEmpty()))
        {
            compOne = a;
            compTwo = b;
        }        
    }

    public int[][] getMatrix()
    {
        setupMatrix();
        return matrix;
    }

    public int getSimilarity()
    {
        if (!calculated) setupMatrix();

        return matrix[compOne.length()][compTwo.length()];
    }

    private void setupMatrix()
    {
        int cost = -1;
        int del, sub, ins;
        
        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i <= compOne.length(); i++)
        {
            matrix[i][0] = i;
        }

        for (int i = 0; i <= compTwo.length(); i++)
        {
            matrix[0][i] = i;
        }

        for (int i = 1; i <= compOne.length(); i++)
        {
            for (int j = 1; j <= compTwo.length(); j++)
            {
                if (compOne.charAt(i-1) == compTwo.charAt(j-1))
                {
                    cost = 0;
                }
                else
                {
                    cost = 1;
                }

                del = matrix[i-1][j]+1;
                ins = matrix[i][j-1]+1;
                sub = matrix[i-1][j-1]+cost;

                matrix[i][j] = minimum(del,ins,sub);

                if ((i > 1) && (j > 1) && (compOne.charAt(i-1) == compTwo.charAt(j-2)) && (compOne.charAt(i-2) == compTwo.charAt(j-1)))
                {
                    matrix[i][j] = minimum(matrix[i][j], matrix[i-2][j-2]+cost);
                }
            }
        }

        calculated = true;
        displayMatrix();
    }
    
    private void displayMatrix()
    {
        System.out.println("  "+compOne);
        for (int y = 0; y <= compTwo.length(); y++)
        {
            if (y-1 < 0) System.out.print(" "); else System.out.print(compTwo.charAt(y-1));
            for (int x = 0; x <= compOne.length(); x++)
            {
                System.out.print(matrix[x][y]);
            }
            System.out.println();
        }
    }

    private int minimum(int d, int i, int s)
    {
        int m = Integer.MAX_VALUE;

        if (d < m) m = d;
        if (i < m) m = i;
        if (s < m) m = s;

        return m;
    }

    private int minimum(int d, int t)
    {
        int m = Integer.MAX_VALUE;

        if (d < m) m = d;
        if (t < m) m = t;

        return m;
    }
}

Further to the comments and observations of @zooz (see comments below), I have to apologise and advise that the code above is actually the Optimal String Alignment Distance Algorithm rather than Damerau Levenshtein. Here is the Damerau Levenshtein code in Java:

public int getDHSimilarity()
{
        int res = -1;
        int INF = compOne.length() + compTwo.length();

        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i < compOne.length(); i++)
        {
            matrix[i+1][1] = i;
            matrix[i+1][0] = INF;
        }

        for (int i = 0; i < compTwo.length(); i++)
        {
            matrix[1][i+1] = i;
            matrix[0][i+1] = INF;
        }

        int[] DA = new int[24];

        for (int i = 0; i < 24; i++)
        {
            DA[i] = 0;
        }

        for (int i = 1; i < compOne.length(); i++)
        {
            int db = 0;

            for (int j = 1; j < compTwo.length(); j++)
            {

                int i1 = DA[compTwo.indexOf(compTwo.charAt(j-1))];
                int j1 = db;
                int d = ((compOne.charAt(i-1)==compTwo.charAt(j-1))?0:1);
                if (d == 0) db = j;

                matrix[i+1][j+1] = Math.min(Math.min(matrix[i][j]+d, matrix[i+1][j]+1),Math.min(matrix[i][j+1]+1,matrix[i1][j1]+(i - i1-1)+1+(j-j1-1)));
            }
            DA[compOne.indexOf(compOne.charAt(i-1))] = i;
        }
        
        return matrix[compOne.length()][compTwo.length()];
}

Java example of Levenshtein’s distance algorithm

June 28, 2010

The purpose of this algorithm is to measure the difference between two sequences/strings. It is based around the number of changes required to make one string equal to the other.

It is aimed at short strings, it usage is spell checkers, optical character recognition, etc.

Wikipedia entry can be found here:


public class Levenshtein
{
    private String compOne;
    private String compTwo;
    private int[][] matrix;
    private Boolean calculated = false;

    public Levenshtein(String one, String two)
    {
        compOne = one;
        compTwo = two;
    }

    public int getSimilarity()
    {
        if (!calculated)
        {
            setupMatrix();
        }
        return matrix[compOne.length()][compTwo.length()];
    }
    
    public int[][] getMatrix()
    {
        setupMatrix();
        return matrix;
    }

    private void setupMatrix()
    {
        matrix = new int[compOne.length()+1][compTwo.length()+1];

        for (int i = 0; i <= compOne.length(); i++)
        {
            matrix[i][0] = i;
        }

        for (int j = 0; j <= compTwo.length(); j++)
        {
            matrix[0][j] = j;
        }

        for (int i = 1; i < matrix.length; i++)
        {
            for (int j = 1; j < matrix[i].length; j++)
            {
                if (compOne.charAt(i-1) == compTwo.charAt(j-1))
                {
                    matrix[i][j] = matrix[i-1][j-1];
                }
                else
                {
                    int minimum = Integer.MAX_VALUE;
                    if ((matrix[i-1][j])+1 < minimum)
                    {
                        minimum = (matrix[i-1][j])+1;
                    }

                    if ((matrix[i][j-1])+1 < minimum)
                    {
                        minimum = (matrix[i][j-1])+1;
                    }

                    if ((matrix[i-1][j-1])+1 < minimum)
                    {
                        minimum = (matrix[i-1][j-1])+1;
                    }

                    matrix[i][j] = minimum;
                }
            }
        }
        calculated = true;
        displayMatrix();
    }

    private void displayMatrix()
    {
        System.out.println("  "+compOne);
        for (int y = 0; y <= compTwo.length(); y++)
        {
            if (y-1 < 0) System.out.print(" "); else System.out.print(compTwo.charAt(y-1));
            for (int x = 0; x <= compOne.length(); x++)
            {
                System.out.print(matrix[x][y]);
            }
            System.out.println();
        }
    }
}