Java example of Jaro Winkler

This algorithm is purposely designed for record linkage, it was designed for linking short strings. It calculates a normalised score on the similarity between two strings.

The calculation is based on the number of matching characters held within the string and the number of transpositions.

The wikipedia entry can be found here, the other source was lingpipe:


public class JaroWinkler
{
    private String compOne;
    private String compTwo;

    private String theMatchA = "";
    private String theMatchB = "";
    private int mRange = -1;
    
    public JaroWinkler()
    {
    }
    
    public JaroWinkler(String s1, String s2)
    {
        compOne = s1;
        compTwo = s2;
    }

    public double getSimilarity(String s1, String s2)
    {
        compOne = s1;
        compTwo = s2;

        mRange = Math.max(compOne.length(), compTwo.length()) / 2 - 1;
        
        double res = -1;

        int m = getMatch();
        int t = 0;
        if (getMissMatch(compTwo,compOne) > 0)
        {
            t = (getMissMatch(compOne,compTwo) / getMissMatch(compTwo,compOne));
        }

        int l1 = compOne.length();
        int l2 = compTwo.length();

        double f = 0.3333;
        double mt = (double)(m-t)/m;
        double jw = f * ((double)m/l1+(double)m/l2+(double)mt);
        res = jw + getCommonPrefix(compOne,compTwo) * (0.1*(1.0 - jw));
        
        return res;
    }

    private int getMatch()
    {
        
        theMatchA = "";
        theMatchB = "";
        
        int matches = 0;

        for (int i = 0; i < compOne.length(); i++)
        {
            //Look backward
            int counter = 0;
            while(counter <= mRange && i >= 0 && counter <= i)
            {
                if (compOne.charAt(i) == compTwo.charAt(i - counter))
                {
                    matches++;
                    theMatchA = theMatchA + compOne.charAt(i);
                    theMatchB = theMatchB + compTwo.charAt(i);
                }
                counter++;                
            }

            //Look forward
            counter = 1;
            while(counter <= mRange && i < compTwo.length() && counter + i < compTwo.length())
            {
                if (compOne.charAt(i) == compTwo.charAt(i + counter))
                {
                    matches++;
                    theMatchA = theMatchA + compOne.charAt(i);
                    theMatchB = theMatchB + compTwo.charAt(i);
                }
                counter++;
            }
        }
        return matches;
    }

    private int getMissMatch(String s1, String s2)
    {
        int transPositions = 0;

        for (int i = 0; i < theMatchA.length(); i++)
        {
            //Look Backward
            int counter = 0;
            while(counter <= mRange && i >= 0 && counter <= i)
            {
                if (theMatchA.charAt(i) == theMatchB.charAt(i - counter) && counter > 0)
                {
                    transPositions++;
                }
                counter++;
            }

            //Look forward
            counter = 1;
            while(counter <= mRange && i < theMatchB.length() && (counter + i) < theMatchB.length())
            {
                if (theMatchA.charAt(i) == theMatchB.charAt(i + counter) && counter > 0)
                {
                    transPositions++;
                }
                counter++;
            }
        }
        return transPositions;
    }

    private int getCommonPrefix(String compOne, String compTwo)
    {
        int cp = 0;
        for (int i = 0; i < 4; i++)
        {
            if (compOne.charAt(i) == compTwo.charAt(i)) cp++;
        }
        return cp;
    }
}

Advertisements

Tags: , , , ,

7 Responses to “Java example of Jaro Winkler”

  1. Similarity algorithms « It's Smee Blog Says:

    […] Damerau-Levenshtein Hamming Jaro Winkler N-Gram Markov […]

  2. Dick Says:

    I guess you must have refactored your code without cleaning up all of the leftover vestiges of prior code, and I think you end up with an unintentional result.

    Your getMisMatch(String s1, String s2) never uses s1 or s2. Consequently, the value of t in getSimilarity(…) will always be either 0 or 1.

    I haven’t yet studied the code rigorously to discern what you really intended, but I thought it worth bringing that to your attention.

  3. Dick Says:

    As I’ve studied the code as well as similar code from another source, it seems to me that s1 should be used in place of theMatchA and s2 in place of theMatchB in your getMissMatch method.

    Do you agree?

    • itssmee Says:

      The problem here is, getMissMatch (being a private function) does not need to take any parameters as theMatchA and theMatchB have both been set in the private function getMatch() which is called before getMisMatch in the function getSimilarity(String, String).

      The variables theMatchA and theMatchB are made via a comparison of compOne to compTwo in getMatch() (where compOne and compTwo are equal to s1 and s2 and set in getSimilarity).

      This is basically an over-sight when I refactored the code.

      Feel free to suggest your version of the code.

      It has been a while since I wrote this.

  4. Rysing Ding Says:

    Is a jork? similarity of 0000001 and 0000001 is 2.02.

  5. DIOP Says:

    i think you need this, i tried to execute your code :

    package j;

    public class Jaro
    {
    private String compOne;
    private String compTwo;

    private String theMatchA = “”;
    private String theMatchB = “”;
    private int mRange = -1;
    private int transPositions=0;
    private int matches=0;

    public Jaro()
    {
    }

    public int getTransPositions() {this.getMissMatch();
    return transPositions;
    }

    public int getMatches() {
    return matches;
    }

    public Jaro(String s1, String s2)
    {
    compOne = s1;
    compTwo = s2;
    }

    public double JaroIndice()
    {

    mRange = Math.max(compOne.length(), compTwo.length()) / 2 – 1;

    double res = -1;

    int m = getMatch();
    int t = 0;
    if (getMissMatch() > 0)
    {
    t = (getMissMatch() / getMissMatch());
    t = (getMissMatch() / 2);
    }

    int l1 = compOne.length();
    int l2 = compTwo.length();

    double f = (double)1/3;
    double mt = (double)(m-t)/m;
    double jw = f * ((double)m/l1+(double)m/l2+(double)mt);
    // res = jw + getCommonPrefix(compOne,compTwo) * (0.1*(1.0 – jw));

    // return jw;

    if(this.compOne.equals(this.compTwo))
    {return 1;}
    else
    {
    if(this.getMatch()==0)
    return 0;
    else
    return jw;
    }
    }

    public int getMatch()
    {
    mRange = Math.max(compOne.length(), compTwo.length()) / 2 – 1;
    theMatchA = “”;
    theMatchB = “”;

    int matches = 0;

    for (int i = 0; i < compOne.length(); i++)
    {
    //Look backward
    int counter = 0;
    while(counter = 0 && counter <= i && i-counter <compTwo.length())
    {
    if (compOne.charAt(i) == compTwo.charAt(i – counter))
    {
    matches++;
    theMatchA = theMatchA + compOne.charAt(i);
    theMatchB = theMatchB + compTwo.charAt(i);
    }
    counter++;
    }

    //Look forward
    counter = 1;
    while(counter <= mRange && i < compTwo.length() && counter + i < compTwo.length() && i-counter <compOne.length())
    {
    if (compOne.charAt(i) == compTwo.charAt(i + counter))
    {
    matches++;
    theMatchA = theMatchA + compOne.charAt(i);
    theMatchB = theMatchB + compTwo.charAt(i);
    }
    counter++;
    }
    }
    return matches;
    }

    public int getMissMatch()
    { mRange = Math.max(compOne.length(), compTwo.length()) / 2 – 1;
    int transPositions = 0;

    for (int i = 0; i < theMatchA.length(); i++)
    {
    //Look Backward
    int counter = 0;
    while(counter = 0 && counter 0)
    {
    transPositions++;
    }
    counter++;
    }

    //Look forward
    counter = 1;
    while(counter <= mRange && i < theMatchB.length() && (counter + i) 0)
    {
    transPositions++;
    }
    counter++;
    }
    }
    return transPositions;
    }

    }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: