Ujjwal Aggarwal

𝗗𝗦𝗔 𝗜𝗡 𝗧𝗛𝗘 𝗪𝗜𝗟𝗗 - 𝗣𝗔𝗥𝗧 𝟮

Today I want to write about the spell checking algorithms, that you frequently come across as a 𝑟̼𝑒̼𝑑̼ 𝑠̼𝑞̼𝑢̼𝑖̼𝑔̼𝑔̼𝑙̼𝑦̼ 𝑙̼𝑖̼𝑛̼𝑒̼ under a misspelt word, again, the implementation |a͟͞b͟͞s͟͞t͟͞r͟͞a͟͞c͟͞t͟͞e͟͞d͟͞ | from you. Not that you 'need' to know, but knowledge is useful.

Before that, for the sake of closure, continuing from the last post of this series, I want to write a few things about the "diff" algorithm.

Firstly, you could handle big files by treating them as smaller parts broken from the original at points of common content (can be found by brute force). This way you could find the diff in the not-matching part and continue with the next part.

Also running it for each word instead of each line in the two files would be way more space consuming.

Hunt and Mcllroy approach is basically the LCS solution. It modifies this algorithm to improve on the space-complexity, at cost of time-complexity, though it regularly beats the worst case with typical inputs. This is done by not storing the whole DP table instead working with it on the fly and some more 'clever' tricks to construct the "LCS". Details here The Hunt-Szymanski Algorithm for LCS and An Algorithm for Differential File Comparison

Myers Algorithm works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memorized, and the two subsequences preceding and following it are compared again recursively. This approach has the advantage to occupy O(m+n) memory, and executes in O(nd), d being the edit script complexity. (A relevant blog by James Coglan)

Now, spell checkers :

𝙀𝙖𝙧𝙡𝙮 𝘿𝙚𝙫𝙚𝙡𝙤𝙥𝙢𝙚𝙣𝙩𝙨:

Glance and Blair's method flagged misspelled words but couldn't suggest corrections. It used a large wordbank of correct words, and compared the typed word against it.

Ralph Gorin's program suggested plausible correct spellings based on edit distance of one. Gorin's approach was significant but limited in handling spelling errors beyond one edit distance. (edit means insert/delete/change).

𝙇𝙚𝙫𝙚𝙣𝙨𝙝𝙩𝙚𝙞𝙣 𝘿𝙞𝙨𝙩𝙖𝙣𝙘𝙚 𝘼𝙡𝙜𝙤𝙧𝙞𝙩𝙝𝙢:

A mathematical equation to calculate edit distance between two strings.

Recursive nature made it impractical for use in spellchecking.

𝙒𝙖𝙜𝙣𝙚𝙧-𝙁𝙞𝙨𝙘𝙝𝙚𝙧 𝘼𝙡𝙜𝙤𝙧𝙞𝙩𝙝𝙢:

Utilizes dynamic programming to compute edit distance efficiently.

Considerably more efficient compared to Levenshtein's algorithm. Not very impressive but they were the first ones to do it.

𝕄𝕠𝕕𝕖𝕣𝕟 𝕊𝕡𝕖𝕝𝕝𝕔𝕙𝕖𝕔𝕜𝕖𝕣𝕤:

Examples include Word Check and Microsoft Word in the 80s and 90s. They improve on the previous algorithm.

Today's spellcheckers also utilize advanced algorithms and context understanding.

Further advancements in spellcheckers incorporate techniques like natural language processing.

credits: this video from b001 and some relevant web articles