Building Open Source Tamil Spellchecker – Released Iyal Spellchecker
Iyal Tamil Spellchecker
I am working on a Free/Open Source Tamil Spellchecker. Released it as Iyal at
https://iyal.kaniyam.caIyal means Prose/Text in Tamil. ( My daughter name too ).
Sharing few notes here.
A good Free/Open Source Tamil Spellchecker is a dream for many decades for me. Explored on these around 2020.
I realized that we need a huge word bank to keep as a base. For a year, I started to collect words from blogs, websites. around 600+ websites.
Collecting huge words list
Wrote a python script to do the below things.
Download recent articlesextract the textclean the text – Remove HTML keywords/symbols, English words, numbersIn a text file, add the unique words, with word countIncrement the count, if the word already existsCreate frequently used word list, if a word has 50+ usage.Create a bloom filter model file with this frequently used word listNow, this files become a base asset for the spellchecker.
We have now 150819 unique highly used words in our collection. see the word list here – https://github.com/KaniyamFoundation/iyal-tamil-spellchecker/tree/main/collect_words
Backend
Bloom filter can quickly find if a word is there or not in a given huge word list. BK-Tree can suggest nearly similar words.
With these two, built a small spellchecker in Python.
Added flask to give as API and make a web version
Front End
Used simple JavaScript and HTML to build a web interface. Added the words counter, “In-Progress/DONE” messages. Huge content is processed as small chunks to make sure the site is not broken.
Pre-filters
There are three files.
rightwordlist.txt – to store all always right words
wrongwordlist.txt – to store all always wrong words
replacements.txt – to have custom dictionary based replace, like suggesting good tamil words for English words. example – பஸ் | பேருந்து
These three file are processed as a first filter.
Adding LanguageTool.org for simple grammar checking
The spellchecker is working good, powered by big word list. Still, there are many things to improve. We need to add Sandhi checker, and basic grammar checker.
Some 15 years ago, Elanchezhiyan and Prof Ilantamil from Malaysia, Thamizha Community, added many basic Tamil grammar rules to LanguageTool.org it is a generic grammar engine for all languages.
Fortunately, it is still maintained project and working well when hosted locally.
Installed LanguageTool and implemented check/suggestions with the current Iyal Spellchecker.
Adding Tamilinaiya Vaani Spellchecker
Around 2020, Mr. Neechalkaran, creator of Vaani Spellchecker, released a mini version of his spellchecker as open source, as part of a grant program by Tamil Virtual Academy. Code is here –
https://github.com/Tamil-Virtual-Academy/Tamilinaiya-SpellcheckerIt is a rule based spellchecker. All the rules are in DB.json file. The code is in C#. My friends Manik and Ashok Ramachandran helped to port that to Python – port is here – https://github.com/tshrinivasan/Tamilinaiya-Spellchecker/tree/master/PythonPort
At that time, we made as a command line spellchecker in python. Thanks to Neechalkaran and Tamil Virtual Academy for the nice work and releasing as Free/Open Source Software. It will be so good, if all the government and university sponsored projects are released as Free/Open Source software, in all the world.
The Word bank based Iyal spellchecker is good. But, as Tamil is ever growing language, collecting/adding all the world in Tamil is taking more time. Thought of integrating Tamilinaiya Vaani to Iyal, as a preprocessor. It works well as expected.
Suggestions
BK-Tree is a simple algorithm to suggest the near similar words. When the words are not found in word bank, BK tree suggests a near similar word. When Tamilinaiya Vaani has some suggestion it is added. LanguageTool suggestions also added in the suggestion menu.
Handling huge content
When a huge content is pasted, it is divided into multiple chunks ( 200 words per chunk ), then processed, to make sure the server is alive. Timer is added to know how long it will take to complete processing all the words.
Architecture
[ USER UI (Vanilla JS) ]
|
| (Batch Streaming POST)
v
[ FLASK BACKEND (app.py) ]
|
+--- 1. CUSTOM OVERRIDES (whitelist.txt, blacklist.txt, replacements.txt)
|
+--- 2. L1 CACHE (Bloom Filter: Instant Dictionary Check)
|
+--- 3. L2 ENGINE (Tamilinaiya Vaani: Morphological Rule-Check)
|
+--- 4. L3 LanguageTool Suggestions
|
+--- 5. L5 FALLBACK (BK-Tree: Fuzzy Similarity Search)
|
v
[ JSON RESPONSE ] --> (UI Highlight / Suggestion Menu)
Live at iyal.kaniyam.ca
With the current design, happy to release the Iyal Tamil Spellchecker at
https://iyal.kaniyam.caCode – https://github.com/KaniyamFoundation/iyal-tamil-spellchecker
Current version is 0.0.3 It is still under beta.
Tamil scholars may find this spellchecker as a elementary one. But, still it is a good working version. Give a try. Test with the tamil content you write or read. If you have any suggestions to improve, raise as an issue in github. or if you have code contributions, send as PR.
Send a mail to [email protected] with your feedback.
What next?
- Keep adding more words to the word bank
- Check and remove any wrong words from the word bank
- Add more replacement words
- Add any more available open source spellcheckers, as layers.
- Make the site a secured and robust
- Make it full HTML compatible. Currently it works for plain text and for basic HTML formatting like bold, italic, headings only.
- Make extensions for browsers, word processors, editors across all operating systems.
Will keep working on them. Please contribute to improve these.
Read all days notes on building tamil spellchecker.
Study notes on open-tamil spellchecker – day 1Building Tamil Spellchecker – Day 2 – Bloom Filter to quick query on datasetBuilding Tamil Spellchecker – Day 3 – Collecting all Tamil NounsBuilding Tamil Spellchecker – Day 4 – Shall we collect ALL Tamil Words?Building Tamil Spellchecker – Day 5 – started collecting ALL Tamil WordsBuilding Open Source Tamil Spellchecker – Day 6 – How fast is bloom filter for 24 lakh words?Building Open Source Tamil Spellchecker – Day 7 – Scrapping websites to get more wordsBuilding Open Source Tamil Spellchecker – Day 8 – Porting from C# to PythonBuilding Open Source Tamil Spellchecker – Day 9 – Ported from C# to PythonBuilding Open Source Tamil Spellchecker – Day 10 – Released Iyal Tamil Spellchecker Rate this:
#bkTree #bloomFilter #spellchecker #tamil #tamilinaiyaVaani