Fuzzy string matching is like using regex and using the equals operator but in a fuzzy way so it doesn’t have to be the same string. In simple words, Fuzzy string matching is a technique of locating partial matches to a search query in data based on phonetic similarity. In this post, we’re going to look at Fuzzy Text Search using Python that uses the Levenshtein Distance method. Levenshtein Distance, also known as Edit Distance, it’s a technique used to measure how similar two strings are. So how does it work?
First, we calculate the Levenshtein distance between the two strings. The Levenshtein distance is the minimum number of single-character (i.e., no insertions, deletions, or substitutions) edits needed to change one string into the other.
Next, we compare the distance and determine if it’s less than a certain threshold. If it is, we assume that the strings are similar and we treat them as the same entity.
Let’s see through an example of why and where we need Fuzzy Text Search using Python:
a1 = "Hello World"
a2 = "hello world"
Here we have these two strings but they are not the same in python because if you try to print these two like this it doesn’t work.
print(s1 is s2)
print(s1 == s2)
You will get false in the output and that’s where fuzzy comes in, it allows us to do string matching more efficiently and more easily in a fuzzy way.
Prerequisites:
thefuzz:
Thefuzz is a new brand of FuzzyWuzzy library that has been updated. It’s the most advanced open-source string matching library for Fuzzy Text Search using Python and was first developed by SeatGeek to help decipher whether or not two similarly named ticket listings were for the same event. FuzzyWuzzy evaluates the Levenshtein distance (a variation on edit distance that accounts for character insertion and deletions) to make this possible. In addition, FuzzyWuzzy contains functionality for evaluating string similarity in other circumstances that we’ll get into below.
You can install thefuzz using the following command in your terminal:
pip install thefuzz
pip install python-Levenshtein
Code
In this tutorial, we will be going through examples to better understand how it works and apply the process practically for yourselves.
Example: simple ratio
from thefuzz import fuzz
from thefuzz import process
s1 = "i am a coder"
s2 = "I am a coder"
print(fuzz.ratio(s1, s2))
Here again, we have two strings, and this time we are going to see how similar they are using fuzz.ratio method.
Output
92
As you can see they are 92% similar. this is a very basic similarity measure
Example: partial ratio
from thefuzz import fuzz
from thefuzz import process
s1 = "i am a coder"
s2 = "i am a coder and a very good one"
# using a partial ratio method
print(fuzz.partial_ratio(s1, s2))
Here we are using fuzz.partial_ratio method to see how partially similar are they meaning if the string has some of the words in common order then they are partially similar.
Output
100
Here the strings are different from each other but still, we get a score of 100 because fuzzy is looking at individual parts and found that part of the string is similar
Example: token sort ratio
from thefuzz import fuzz
from thefuzz import process
s1 = "how are you I am a coder"
s2 = "I am a coder how are you"
print(fuzz.partial_ratio(s1, s2))
print(fuzz.ratio(s1, s2))
print(fuzz.token_sort_ratio(s1,s2))
output:
50
50
100
Here we have an output of 100 using token_sort_ratio method because it’s the same sentence but in a different order. token short doesn’t care in what order the words are it will just find the same words.
Example: token set ratio
from thefuzz import fuzz
from thefuzz import process
s1 = "hello how are you, i am a coder coder coder"
s2 = "hello how are you, i am a coder"
print(fuzz.token_set_ratio(s1,s2))
Output
100
Here we have an output of 100 using token_set_ratio method because a set contains each token just once so doesn’t matter how many times it occurs.
Some more common examples:
1. partial_token_set_ratio
2. partial_token_short_ratio
Example: The process
The process is used to extract the text using this fuzzy matching from a collection.
from thefuzz import fuzz
from thefuzz import process
# here we have a list of things that has some similarities
things = ["programming language", "complete language", "home policy", "your left", "my left",
"government policy", "good hell", "good heaven"]
# now lets pick best matching stuff
print(process.extract("policy", things, limit=2))
Output
[('home policy', 90), ('government policy', 90)]
Here in the output, it gave us two of the most matching strings that are closest to the policy word. Let’s look at the one more example:
Example: fuzzy string matching in pandas Data-frame
Import dependencies
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Creating the dictionaries for matching the strings.
dict1 = {'name': ["policy", "language", "home", "left", "good"]}
dict2 = {'name': ["complete language", "complete language", "home policy", "your left", "my left", "government policy", "good heaven"]}
Converting the dictionaries to pandas data frames.
dframe1 = pd.DataFrame(dict1)
dframe2 = pd.DataFrame(dict2)
Create empty lists for storing the matches later.
mat1 = []
mat2 = []
p = []
Print the pandas dataframes.
print("First dataframe:\n", dframe1, "\nSecond dataframe:\n", dframe2)
Converting data frame column to list to do fuzzy matching.
list1 = dframe1['name'].tolist()
list2 = dframe2['name'].tolist()
We set the threshold for this at 85 but we want you to feel free to change it as much as you like.
threshold = 85
Going through the list1 to extract it’s the closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=2))
dframe1['matches'] = mat1
Going through the closest matches to filter out the maximum closest match it can be found
for j in dframe1['matches']:
for k in j:
if k[1] >= threshold:
p.append(k[0])
mat2.append(",".join(p))
p = []
Storing the matching resultant back to dframe1.
dframe1['matches'] = mat2
print("\n This is the dataframe we get after fuzzy matching:")
dframe1
Output
Now that you have the output that it gave us. we can see it has given names and their corresponding lists.
Final Words
We’ve all been there. You are looking for a piece of information in a large text file and you don’t remember the exact name of the file. This is a common problem when working with log files and other text files. In this blog, we have given a basic introduction to fuzzy string matching and fuzzy text search in python in the form of simple examples. I have also described some of the basic methods that are involved in fuzzy string matching algorithms. Hope you like the blog, If you have any queries, feel free to comment in the comment section below.
Here are some useful tutorials that you can read:
- Concurrency in Python
- Basic Neural Network in Python to Make Predictions
- Monitor Python scripts using Prometheus
- Test Your Typing Speed Using Python
- Instagram Hashtag Generator in Python
- How to create a Word Guessing Game in Python
- Convert an image to 8-bit image
- Programmatically Generate Video or Animated GIF in Python
- Sudoku game in Python using Pygame
- How to Deploy Flask API on Heroku?
- How to Update your Mac Address using Python
- How to create CLI in Python?
- Automate Reddit Posts using Python