Exploring NLP 04 -Minimum Edit Distance 2/4

In the previous article we looked in to what ‘Levenshtein distance‘ is and an example use case. We saw that when the word we are transforming has small number of letters it is easy to find the minimum edit distance, but as the number of letters in the word increases it is difficult to be sure that we have found the ‘minimum’. So we will be discussing a method that will make sure that we end up with the value for minimum edit distance.

We will take the following example.

five four

From one glance we can see that we need minimum 3 operations to transform five to four (3 replace operations). Lets see how this new method would give that answer. First we have to make a grid like follows. Columns will have the word before transformation and the rows will have the word after transformation. The ‘ε‘  (epsilon) denotes an empty string.

ε f i v e
ε
f
o
u
r

We start from the top and we fill the cells. We read the grid from top to left. For example, to fill the first cell we have to think ‘how many operations does it take to convert an empty string (ε) to an empty string (ε) ?’ The answer is zero. So we put 0 there.

ε f i v e
ε  0
f
o
u
r

Next, to fill the second cell we have to think ‘how many operations does it take to convert a ‘f’ to an empty string (ε) ?’ The answer is 1 (delete ‘f’). So we put 1 there.

ε f i v e
ε  0  1
f
o
u
r

Next to fill the third cell we have to think ‘how many operations does it take to convert ‘fi’ to an empty string (ε) ?’ The answer is 2 (delete ‘f’, delete ‘i’). So we put 2 there.

ε f i v e
ε  0  1  2
f
o
u
r

We can fill the first row in a similar manner and then we get the following.

ε f i v e
ε  0 1  2  3  4
f
o
u
r

Next to fill the first cell in the second row we have to think ‘how many operations does it take to convert an empty string ‘ε‘ to ‘f’ ?’ The answer is 1 (add ‘f’). So we put 1 there.

ε f i v e
ε 0 1 2 3 4
f 1
o
u
r

Next to fill the second cell in the second row we have to think ‘how many operations does it take to convert ‘f’ to ‘f’ ?’ The answer is 0. So we put 0 there.

ε f i v e
ε 0 1 2 3 4
f 1 0
o
u
r

similarly we can keep on going filling.

ε f i v e
ε  0 1 2 3 4
f  1  0 1 2 3
o 2 1 1  x
u
r

One more example, so how do we fill the above ‘x’ ?  We have to think ‘how many operations does it take to convert ‘fiv’ to ‘fo’ ?’ The answer is 2 (replace ‘i’ with ‘o’. delete ‘v’). So we put 2 there.

ε f i v e
ε  0 1 2 3 4
f 1 0 1 2 3
o 2 1 1 2
u
r

Finally we end up with something like this.

ε f i v e
ε 0 1 2 3 4
f 1 0 1 2 3
o  2 1 1 2 3
u  3 2 2 2 3
r  4 3 3 3 3

The ‘Minimum Edit Distance’ (Levenshtein distance) is the value of the last cell.

So we get the same answer 3. For the example I took a small word, but when dealing with longer words this grid method is much useful in making sure we got the minimum number of operations.

Okay so we now know the grid method, but it takes time right ? What if there is a cheat method? a short cut to fill this table ? There is. We will look in to it in the next part, and we will finish up with the python code for this calculation in the article after that.

Exploring NLP 05 – Minimum Edit Distance 3/4

 

Exploring NLP 03 -Minimum Edit Distance 1/4

Even though I was planning on having a nltk tutorial as the 3rd article, I came across the ‘Levenshtein distance ‘ or simply know as the ‘Minimum edit distance‘.  I thought I would share some ideas on that first. Before going to writing codes lets try to see in what scenario do we need this.

Imagine a word processing tool we are using, say Microsoft Word, when we type some words with incorrect spellings it usually underlines it in red or something and then give us the suggestions right ? So how did it do that ?

These use a basic technique where it calculate the similarities of our word with the most frequent similar words with correct spellings. (of cause a tool like Microsoft Word is using much more complex algorithms)

For example if I type ‘supehero’, the engine will ask me something like ‘did you mean superhero’. So what it does is it sees that the ‘supehero’ is not a valid English word, therefore it tries to find out words which are having similar letters, say superhero, superman, supergirl,  etc. Then it calculate Levenshtein distance to each word and then return the word with the least distance. nlp1

Usually depending on the algorithm and the use case the tool will suggest us one word or few word. In the image it suggested one word. So lets see how it does that.

Basically the ‘minimum edit distance’ is the number of operations that we have to do on a particular word in order to transform it to a different word. Okay, first question is ‘what are the allowed operations ?’

There are 3 operations that we can choose from.

  • add a letter
  • delete a letter
  • substitute a letter

Lets see few example.

Operation Word Before Word After Comment
Add a letter ello hello Add ‘h’
Delete a letter heallo hello Delete ‘a’
Substitute a letter helbo hello Replace ‘b’ with ‘l’

In each of the previous examples we can see that we are able to apply a single operation and get the desired word. So in the above cases the minimum edit distance is 1.

Lets see another example

rabbit habit

The m.e.d. is 2 in this case. How ?  First we have to do the substitute operation for ‘r’ -> ‘h’. Then we have to do the delete operation to delete the unnecessary ‘b’.

(One might say it can be done in three operations as well. Add a ‘h’, remove the ‘r’ and remove the ‘b’. Yes that is true. We can do it in various ways. But the Levenshtein distance is the MINIMUM number of operations. Therefore we have to find the least number of operations that we need)

Lets see another example

elephant elegant Distance = 2

(replace ‘p’ with ‘g’. Add ‘h’)

When we look at it there are many ways of doing the transformation. For simple few letter words finding the minimum number of operations are easy. But for longer words like above, how can we be sure that we have found the minimum distance, that there is no other solution which will be able to do this in lesser number of steps ? We will see that in the next article since that will take a bit more explaining than this.

Exploring NLP 04 – Minimum Edit Distance 2/4