Recently, I had an interesting discussion with our TX Spell .NET team lead. I only wanted to know what he is currently working on. A question you should better not ask an enthusiastic developer, if you don't have enough time. But I took the time to learn more about creating suggestions for misspelled words.
The big question is:
How to create appropriate suggestions for a misspelled word in a language you may not be familiar with?
We spend a lot of time researching such questions in order to provide you with the best components on the market. I won't disclose our secrets why TX Spell .NET is so fast and accurate, but I thought to share some basics to give you an idea of the complexity on this subject.
Two main steps are required to generate a list of appropriate suggestions:
Transformation and Permutation
In a first step, all possible transformations and permutations of the misspelled word must be created to a specific depth level. This is the most time consuming process. Characters must be removed, added, replaced or shifted. The performance of this algorithm is the key element in this process.
Evaluation and Rating
After all possible candidates have been created, they must be somehow weighted. This should increase the probability that the first suggestion is the word that the user wanted to type originally.
But how to rank such a candidate?
There are many factors that must be included in such algorithms. The obvious factor are phonetic replacements. Consider the following word:
ENOUF -> ENOUGH
Fshould be replaced with it's phonetic opponent
But this is just the most simple way to rate a suggestion. More complex considerations are required to build a high-potential replacement word. Another approach is to measure the distances between the keys on the currently used keyboard. Considering a US English keyboard, the probability of pressing the "S" key instead of the "A" is much higher than hitting the "L" which is on the other side of the keyboard. But at the same time, the algorithm must decide whether the pressed "L" was intended and the "A" was just missed. As you can see, this is a very complex order which took us a lot of time and efforts. But we faced the problem to weight the different changes in the suggestion.
If you want to build something exceptional, then do something exceptional.
Following this lead, our TX Spell .NET team analyzed internal chat protocols for misspelled words and typos. Chat histories are very useful, because we don't necessarily correct typos before sending the messages and we type fast when chatting. The analysis shows a varied picture of various factors.
This is just a very simple overview of the approaches to create appropriate suggestions. All of these results are or will be implemented in TX Spell .NET. You can focus on your core business - we do the word processing part.