Even though I was planning on having a nltk tutorial as the 3rd article, I came across the ‘Levenshtein distance ‘ or simply know as the ‘Minimum edit distance‘. I thought I would share some ideas on that first. Before going to writing codes lets try to see in what scenario do we need this.
Imagine a word processing tool we are using, say Microsoft Word, when we type some words with incorrect spellings it usually underlines it in red or something and then give us the suggestions right ? So how did it do that ?
These use a basic technique where it calculate the similarities of our word with the most frequent similar words with correct spellings. (of cause a tool like Microsoft Word is using much more complex algorithms)
For example if I type ‘supehero’, the engine will ask me something like ‘did you mean superhero’. So what it does is it sees that the ‘supehero’ is not a valid English word, therefore it tries to find out words which are having similar letters, say superhero, superman, supergirl, etc. Then it calculate Levenshtein distance to each word and then return the word with the least distance.
Usually depending on the algorithm and the use case the tool will suggest us one word or few word. In the image it suggested one word. So lets see how it does that.
Basically the ‘minimum edit distance’ is the number of operations that we have to do on a particular word in order to transform it to a different word. Okay, first question is ‘what are the allowed operations ?’
There are 3 operations that we can choose from.
- add a letter
- delete a letter
- substitute a letter
Lets see few example.
|Operation||Word Before||Word After||Comment|
|Add a letter||ello||hello||Add ‘h’|
|Delete a letter||heallo||hello||Delete ‘a’|
|Substitute a letter||helbo||hello||Replace ‘b’ with ‘l’|
In each of the previous examples we can see that we are able to apply a single operation and get the desired word. So in the above cases the minimum edit distance is 1.
Lets see another example
The m.e.d. is 2 in this case. How ? First we have to do the substitute operation for ‘r’ -> ‘h’. Then we have to do the delete operation to delete the unnecessary ‘b’.
(One might say it can be done in three operations as well. Add a ‘h’, remove the ‘r’ and remove the ‘b’. Yes that is true. We can do it in various ways. But the Levenshtein distance is the MINIMUM number of operations. Therefore we have to find the least number of operations that we need)
Lets see another example
|elephant||elegant||Distance = 2
(replace ‘p’ with ‘g’. Add ‘h’)
When we look at it there are many ways of doing the transformation. For simple few letter words finding the minimum number of operations are easy. But for longer words like above, how can we be sure that we have found the minimum distance, that there is no other solution which will be able to do this in lesser number of steps ? We will see that in the next article since that will take a bit more explaining than this.