Fuzzy Text Search using Python

Fuzzy string matching is like using regex and using the equals operator but in a fuzzy way so it doesn’t have to be the same string. In simple words, Fuzzy string matching is a technique of locating partial matches to a search query in data based on phonetic similarity. In this post, we’re going to look at Fuzzy Text Search using Python that uses the Levenshtein Distance method. Levenshtein Distance, also known as Edit Distance, it’s a technique used to measure how similar two strings are. So how does it work?

First, we calculate the Levenshtein distance between the two strings. The Levenshtein distance is the minimum number of single-character (i.e., no insertions, deletions, or substitutions) edits needed to change one string into the other.

Next, we compare the distance and determine if it’s less than a certain threshold. If it is, we assume that the strings are similar and we treat them as the same entity.

Let’s see through an example of why and where we need Fuzzy Text Search using Python:

a1 = "Hello World"
a2 = "hello world"

Here we have these two strings but they are not the same in python because if you try to print these two like this it doesn’t work.

print(s1 is s2)
print(s1 == s2)

You will get false in the output and that’s where fuzzy comes in, it allows us to do string matching more efficiently and more easily in a fuzzy way.

Prerequisites:

thefuzz:

Thefuzz is a new brand of FuzzyWuzzy library that has been updated. It’s the most advanced open-source string matching library for Fuzzy Text Search using Python and was first developed by SeatGeek to help decipher whether or not two similarly named ticket listings were for the same event. FuzzyWuzzy evaluates the Levenshtein distance (a variation on edit distance that accounts for character insertion and deletions) to make this possible. In addition, FuzzyWuzzy contains functionality for evaluating string similarity in other circumstances that we’ll get into below.

You can install thefuzz using the following command in your terminal:

pip install thefuzz

pip install python-Levenshtein

Code

In this tutorial, we will be going through examples to better understand how it works and apply the process practically for yourselves.

Example: simple ratio

from thefuzz import fuzz
from thefuzz import process
s1 = "i am a coder"
s2 = "I am a coder"
print(fuzz.ratio(s1, s2))

Here again, we have two strings, and this time we are going to see how similar they are using fuzz.ratio method.

Output

As you can see they are 92% similar. this is a very basic similarity measure

Example: partial ratio

from thefuzz import fuzz
from thefuzz import process
s1 = "i am a coder"
s2 = "i am a coder and a very good one"
# using a partial ratio method
print(fuzz.partial_ratio(s1, s2))

Here we are using fuzz.partial_ratio method to see how partially similar are they meaning if the string has some of the words in common order then they are partially similar.

Output

Here the strings are different from each other but still, we get a score of 100 because fuzzy is looking at individual parts and found that part of the string is similar

Example: token sort ratio

from thefuzz import fuzz
from thefuzz import process
s1 = "how are you I am a coder"
s2 = "I am a coder how are you"
print(fuzz.partial_ratio(s1, s2))
print(fuzz.ratio(s1, s2))
print(fuzz.token_sort_ratio(s1,s2))

output:

50
50
100

Here we have an output of 100 using token_sort_ratio method because it’s the same sentence but in a different order. token short doesn’t care in what order the words are it will just find the same words.

Example: token set ratio

from thefuzz import fuzz
from thefuzz import process

s1 = "hello how are you, i am a coder coder coder"
s2 = "hello how are you, i am a coder"

print(fuzz.token_set_ratio(s1,s2))

Output

Here we have an output of 100 using token_set_ratio method because a set contains each token just once so doesn’t matter how many times it occurs.

Some more common examples:

1. partial_token_set_ratio

2. partial_token_short_ratio

Example: The process

The process is used to extract the text using this fuzzy matching from a collection.

from thefuzz import fuzz
from thefuzz import process

# here we have a list of things that has some similarities
things = ["programming language", "complete language", "home policy", "your left", "my left",
            "government policy", "good hell", "good heaven"]

# now lets pick best matching stuff
print(process.extract("policy", things, limit=2))

Output

[('home policy', 90), ('government policy', 90)]

Here in the output, it gave us two of the most matching strings that are closest to the policy word. Let’s look at the one more example:

Example: fuzzy string matching in pandas Data-frame

Import dependencies

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Creating the dictionaries for matching the strings.

dict1 = {'name': ["policy", "language", "home", "left", "good"]}
dict2 = {'name': ["complete language", "complete language", "home policy", "your left", "my left", "government policy", "good heaven"]}

Converting the dictionaries to pandas data frames.

dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)

Create empty lists for storing the matches later.

mat1 = []
mat2 = []
p = []

Print the pandas dataframes.

print("First dataframe:\n", dframe1, "\nSecond dataframe:\n", dframe2)

Converting data frame column to list to do fuzzy matching.

list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()

We set the threshold for this at 85 but we want you to feel free to change it as much as you like.

threshold = 85

Going through the list1 to extract it’s the closest match from list2

for i in list1:
    mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1

Going through the closest matches to filter out the maximum closest match it can be found

for j in dframe1['matches']:
    for k in j:
        if k[1] >= threshold:
            p.append(k[0])
    mat2.append(",".join(p))
    p = []

Storing the matching resultant back to dframe1.

dframe1['matches'] = mat2
print("\n This is the dataframe we get after fuzzy matching:")
dframe1

Output

Now that you have the output that it gave us. we can see it has given names and their corresponding lists.

Final Words

We’ve all been there. You are looking for a piece of information in a large text file and you don’t remember the exact name of the file. This is a common problem when working with log files and other text files. In this blog, we have given a basic introduction to fuzzy string matching and fuzzy text search in python in the form of simple examples. I have also described some of the basic methods that are involved in fuzzy string matching algorithms. Hope you like the blog, If you have any queries, feel free to comment in the comment section below.

Here are some useful tutorials that you can read:

Concurrency in Python

Automate Reddit Posts using Python

Fuzzy Text Search using Python

Prerequisites:

thefuzz:

Code

Example: simple ratio

Output

Example: partial ratio

Output

Example: token sort ratio

output:

Example: token set ratio

Output

Example: The process

Output

Example: fuzzy string matching in pandas Data-frame

Output

Final Words

Vyom Srivastava

Leave a Reply Cancel reply

Python Flow-Based Programming libraries

How to use NGROK in Python?

Encrypted File Transfer via Sockets in Python

Integrate Mako Templates with Django

Press ESC to close

Prerequisites:

thefuzz:

Code

Example: simple ratio

Output

Example: partial ratio

Output

Example: token sort ratio

output:

Example: token set ratio

Output

Example: The process

Output

Example: fuzzy string matching in pandas Data-frame

Output

Final Words

Share Article:

How to Write a Cross-Platform Application in Python with Beeware

How to Configure Playbook in Ansible

Leave a Reply Cancel reply