In my last post, I wrote about the static site generator I wrote for this website, mostly as a labor of love. As part of this effort, I wanted to add a number of automated checks to do things such as detecting broken links and also spellchecking.
There are free command-line tools which will automatically check the spelling of words such as aspell
, but in practice I find that they can be pretty terrible because they report too many false positives. There are many technical terms that are not in the aspell
dictionary. Because I don't always write in a very formal style, I tend to use neologisms which are not in every dictionary. It also tends to report company names and even the last names of non-English indivuals as spelling mistakes. It's not uncommon to get something like 10 or even 20 false positives for a blog post.
I have 174 markdown documents in total on this blog, and ideally I would like to be able to spellcheck them all and automate the process in such a way that my deploy script will flag errors and refuse to deploy if a spelling error is found. I could create a custom word list for aspell
and add thousands of custom words to it, but I would probably still run into more false positives in the future. Another issue is that aspell
is just a spell checker. I know from experience that I often end up with unintentional word repetitions such as "the the" or "is is" as I edit text, introducing new errors without noticing.
Ultimately, it's hard to completely automate spellchecking, and if you want more automated grammar checking, it feels like the best tool for the job would be a neural network. In that spirit I tried to use a smaller Llama model with a custom prompt. Unfortunately, this MacBook Air only has 16GB of RAM and the models that can fit in this much memory are not very good. The resulting output was nonsense. I could try to perform an API call to a bigger model, but that requires a network connection and a subscription, which makes it not my favorite solution. I feel like we might see more capable LLMs that can operate with less memory in the future, but for now they're not up to this task.
Something that occurred to me is that it's easy to create a list of misspelled words and patterns that should automatically be flagged. This is obviously not a perfect solution, but I thought I could pretty easily Google and find a list of common English misspellings. I could also add a list of word repetitions to be automatically flagged to this. I implemented the matching using a keyword trie, and immediately found many errors in my existing blog posts that way:
and and
alot
anyways
betweeen
cababilities
dischordant
does't
espensive
imlementation
publically
rythm
unfortunatly
logarithic
likelyhood
persistance
probaby
threadmill
the the
I knew there were probably other errors in my blog posts that were not in my list, so I decided to compile a list of all unique words, sort it, and write it to a text file so that I could visually inspect everything. In total there were about 9000 words including alternate spellings, and sure enough, I found several misspelled words in there, which I added to the list.
As the title of this post hinted, this is pretty low-tech and it obviously won't catch all errors. However, it's nice because it can be reliably automated, it does catch many errors and it doesn't produce false positives, at least not at the scale of one single person always writing in English. What's also nice is that I can curate a custom list of errors I've made in the past, so that they'll be automatically flagged in the future. I'm still going to paste this post into Claude and ask it to check my grammar before publishing it, but now I have a tool that can guard any edits I make to all existing documents as well.