Posts Tagged ‘example code’

Similarity algorithms

June 28, 2010

I have recently been researching a record linkage techniques, and part of this process I have been reminding myself of certain algorithms, and in other case learning these for the first time. As is my way, I typically try and turn the algorithm into code to allow me to understand and learn it. I have coded up examples of the following algorithms in Java:

Levenshtein
Damerau-Levenshtein
Hamming
Jaro Winkler
N-Gram
Markov Chain

Out of curiosity I decided to publish each one separately so I can see from the stats which is the most popular. In a month or so I will add a histogram of the hits for each one (if there are any!).

I hope this is useful.

Be warned, I offer no warranty or guarantee on this code, any changes / enhancements / corrections / or use of this code should be attributed to itssmee, CHIME, UCL (where I am doing my research) and shared with the community.

Java example of N-gram

June 28, 2010

An algorithm to calculate the probability of the next term based on the previous n terms.

They are used in speech recognition, phonemes, language recognition etc.

Wikipedia entry can be found here. Wikipedia, in this case, is a little limited on either pseudo code or an algorithm. In this case I referenced the following presentation (Slides 27/28) from CRULP talking about different algorithms used in spell checkering. Also, further reading can be found in Chapter 6 of Christopher D Manning and Hinrich Schutze’s Foundations of Statistical Natural Language Processing.

Here is the code:


public class Ngram
{
    private class result
    {
        private String theWord;
        private int theCount;

        public result(String w, int c)
        {
            theWord = w;
            theCount = c;
        }

        public void setTheCount(int c)
        {
            theCount = c;
        }

        public String getTheWord()
        {
            return theWord;
        }

        public int getTheCount()
        {
            return theCount;
        }
    }

    private List<result> results;

    public Ngram()
    {
        results = new ArrayList<result>();
    }
    public Ngram(String str, int n)
    {

    }

    public double getSimilarity(String wordOne, String wordTwo, int n)
    {
        List<result> res1 = processString(wordOne, n);
        //displayResult(res1);
        List<result> res2 = processString(wordTwo, n);
        //displayResult(res2);
        int c = common(res1,res2);
        int u = union(res1,res2);
        double sim = (double)c/(double)u;

        return sim;
    }

    private int common(List<result> One, List<result> Two)
    {
        int res = 0;

        for (int i = 0; i < One.size(); i++)
        {
            for (int j = 0; j < Two.size(); j++)
            {
                if (One.get(i).theWord.equalsIgnoreCase(Two.get(j).theWord)) res++;
            }
        }

        return res;
    }

    private int union(List<result> One, List<result> Two)
    {
        List<result> t = One;

        for (int i = 0; i < Two.size(); i++)
        {
            int pos = -1;
            boolean found = false;
            for (int j = 0; j < t.size() && !found; j++)
            {
                if (Two.get(i).theWord.equalsIgnoreCase(t.get(j).theWord))
                {
                    found = true;
                }
                pos = j;
            }

            if (!found)
            {
                result r = Two.get(i);
                t.add(r);
            }
        }

        return t.size();
    }

    private List<result> processString(String c, int n)
    {
        List<result> t = new ArrayList<result>();

        String spacer = "";
        for (int i = 0; i < n-1; i++)
        {
            spacer = spacer + "%";
        }
        c = spacer + c + spacer;
        
        for (int i = 0; i < c.length(); i++)
        {
            if (i <= (c.length() - n))
            {
                if (contains(c.substring(i, n+i)) > 0)
                {
                    t.get(i).setTheCount(results.get(i).getTheCount()+1);
                }
                else
                {
                    t.add(new result(c.substring(i,n+i),1));
                }
            }
        }
        return t;
    }

    private int contains(String c)
    {
        for (int i = 0; i < results.size(); i++)
        {
            if (results.get(i).theWord.equalsIgnoreCase(c))
                return i;
        }
        return 0;
    }

    private void displayResult(List<result> d)
    {
        for (int i = 0; i < d.size(); i++)
        {
            System.out.println(d.get(i).theWord+" occurred "+d.get(i).theCount+" times");
        }
    }
}

Java example of Markov Chain

June 28, 2010

The Markov Chain model calculates the probability of the next letter occurring in a sequence based on the previous n characters.

They are used in a multitude of areas, including data compression, entropy encoding, and pattern recognition.

The wikipedia entry can be found here. I based my algorithm off of an excellent explanation found here, provided by Rutgers as part of their course Discrete and Probabilistic Models in Biology. Warning, I think this document has a couple of mistakes in the matrix on page 10 (the counting of A,C,G,T), or I may be mistaken.

For this code, the final part which is needed is a method which gives the probability of a sequence, that is you pass in a sequence (e.g. AACGT) and it returns the probability of this sequence based on the matrix it has calculated.

The code:


public class Markov
{

    private class similarities
    {
        private char c;
        private int count;
        
        public similarities(char a, int b)
        {
            c = a;
            count = b;
        }
        
        public char getC()
        {
            return c;
        }
        
        public int getCount()
        {
            return count;
        }
        
        public void setCount(int a)
        {
            count = a;
        }
        
        public void increaseCount()
        {
            count++;
        }
    }

    private List<similarities> entries;
    private List<String> axies;
    private double[][] matrix;

    public Markov()
    {
        entries = new ArrayList();
        axies = new ArrayList();
    }

    public void getSimilarity(List<String> s)
    {
        String temp = new String();
        for (int i = 0; i < s.size(); i++)
        {
            if (temp.length() > 0)
            {
                temp = temp + " " + s.get(i);
            }
            else temp = s.get(i);

        }
        setupAxis(temp);
        compareCharsAgainstAxis(temp);

        display();

        for (int i = 0; i < matrix.length; i++)
        {
            double line = getLineValue(i);
            for (int j = 0; j < matrix[i].length; j++)
            {
                double t = matrix[i][j];

                matrix[i][j] = t / line;
            }
        }

        display();
    }

    private void compareCharsAgainstAxis(String s)
    {
        matrix = new double[axies.size()][axies.size()];
        String comparison = "";
        String test = s;

        for (int h = 0; h < test.length(); h++)
        {
            int end = test.indexOf(" ");
            if (end > 0)
            {
                comparison = test.substring(0,end);
            }
            else 
            {
                comparison = test.substring(0,test.length());
                end = test.length()-1;
            }
            test = test.substring(end+1,test.length());

            for (int i = 1; i < comparison.length(); i++)
            {
                for (int j = 0; j < axies.size(); j++)
                {
                    for (int k = 0; k < axies.size(); k++)
                    {
                        if (String.valueOf(comparison.charAt(i-1)).equalsIgnoreCase(axies.get(j)) && String.valueOf(comparison.charAt(i)).equalsIgnoreCase(axies.get(k)))
                        {
                            matrix[j][k]++;
                        }
                    }
                }
            }
        }

    }
    private void setupAxis(String temp)
    {
        for (int j = 0; j < temp.length(); j++)
        {
            boolean found = false;

            for (int k = 0; k < axies.size() && !found; k++)
            {
                if (String.valueOf(temp.charAt(j)).equalsIgnoreCase(axies.get(k)))
                {
                    found = true;
                }
            }

            if (!found)
            {
                if (temp.charAt(j) > 32)
                {
                    axies.add(String.valueOf(temp.charAt(j)));
                }
            }
        }
    }

    private double getTotalValue()
    {
        double sum = 0;
        for (int i = 0; i < matrix.length; i++)
        {
            for (int j = 0; j < matrix[i].length; j++)
            {
                double t = matrix[i][j];
                sum = sum + t;
            }
        }
        return sum;
    }

    private double getLineValue(int line)
    {
        double sum = 0;
        for (int i = 0; i < matrix[line].length; i++)
        {
            double t = matrix[line][i];
            sum = sum + t;
        }
        return sum;
    }

    private void display()
    {
        System.out.println("Sum of matrix is "+getTotalValue());
        System.out.println("Sum of line 1 = "+getLineValue(3));

        System.out.print("  ");
        for (int j = 0; j < axies.size(); j++)
        {
            System.out.print(axies.get(j)+" ");
        }
        System.out.println();
        for (int i = 0; i < matrix.length; i++)
        {
            System.out.print(axies.get(i)+" ");
            for (int j = 0; j < matrix[i].length; j++)
            {
                System.out.print(matrix[i][j]+" ");
            }
            System.out.println();
        }
    }
}