Okapi, or “What if ripgrep Could Edit?”
66 points by buffalo7
66 points by buffalo7
Interesting circle of history.
Grep was produced in a night based on the source code of Ed.
Grep is in fact short for the ed command g/<regular expression>/p shortened to g/re/p and then dropping the slashes.
What is value of this over ed, which comes with most unix like systems?
ed Translations
| Okapi Command | GNU ed Command |
Explanation |
|---|---|---|
okapi III |
/III/ |
Literal string match. Identical in both. |
okapi "Dan[^l ]\b" |
/Dan[^l ]\>/ |
Uses \> for the end-of-word boundary constraint. |
okapi "Mich\wl" -e "Michel" |
v/Michel/g/Mich\wl/ |
v skips lines containing "Michel", then g searches the remainder. |
okapi Fli -c ..15 |
/^.\{,14\}Fli/ |
^ anchors the search, \{,14\} allows up to 14 characters before Fli. |
Neat trick with v/REGEX/g/REGEX2/, I didn't know about that!
I will admit to not being that familiar with ed, but AFAIK it doesn't allow multi-selection. It's line-by-line, right? My process is to search for things that might need fixing, then find classes of issues within that set of lines in the virtual buffer. I can repair them all with multi-select once I know what they are, but it would be very fiddly (impossible) to enumerate them ahead of time, and error-prone to attempt to list the replacements. I gave some examples in the blog post.
Note also that, if you want, you can provide the --columns flag a discontiguous set (..15,20..25,80..). I've used that a couple of times, not sure how you'd achieve that with GNU regex.
I did a project like this around 2008 https://theknowledgeexchangeblog.com/2014/02/25/a-unique-insight-into-uk-new-towns/ ... the New Towns Record was originally scanned documents that were pushed through some proprietary hypertext system and released on a set of CDs in '96 by our commercial library service. A decade later, I got the job of reformatting it as html for online publication. It was about 5GB of text; not much now but as much as my PC could handle back then. I found that the previous releases had been full of repeated "scannos" and asked to spend a couple of weeks fixing those up too.
What I did was more like: count all unique space-separated words in the text-without-markup, filter out any that were in a dictionary (which I added to over time), then start at the top of the list replacing the most frequent typos first with emacs - replacing similar errors many times was way quicker than proofreading the text, tho I did that too. The okapi author says they "needed the precision of regex combined with the power of a text editor."...but, that's what dired-do-find-regexp-and-replace does in emacs.
This was also one of my first uses of git, which had only recently been released - it let me quickly checkpoint what I was doing after each round of edits, where svn was just too slow.
I felt at the time there was a gap in the market for a spellchecker that finds likely OCR errors instead of likely typos (but it's very niche)
Author here, that project does indeed sound very similar. I really like your "token dictionary" approach. I've started down that path by parsing the names and looking for tokens which only appear 1–2 times. This has identified many errors, but it doesn't address repeated scannos (Edvd instead of Edwd), so I'm going to see about running this other sort of filter. Thanks for the suggestion!
As someone coming from the vim side of things, I wasn't aware of dired-do-find-regexp-and-replace. Thanks for the pointer! Even if I'd known about that, though, I still think I'd have wanted to build this. For one thing, it allows you to exclude lines that would otherwise match the regex due to its location in the file (-c) or matching a different regex (-e). Also, from a space efficiency perspective, most of my matches are 1–2 per file across tens to hundreds of files. Putting the path above the matched lines would make much less efficient use of verical space in my use case.
But most importantly, as described later on in the post, having all the items in a Sublime buffer allowed me to overlay the line image, which really speeds up the proofreading cycle.
Oh, and I wrote another little tool to wrap okapi that finds common scannos. It has helped a lot, although it's far from foolproof.
Cool. I think there are some GUI editors that let you edit from inside the search results window. Was it BBEdit?
Emacs lets you edit grep results (whether they came from rg/ag/grep/whatever) with the wgrep ("writable grep") package.
Also if you use the Emacs builtin occur which is a bit like grep, you can "toggle edit mode" to edit the results buffer, with changes going back to the original files.
(And in a similar fashion, the builtin dired shows a directory listing and if you turn on wdired ("writable dired") you can edit the buffer and thus rename files, using regex or keyboard macros or rectangle editing or multiple cursors or an llm or piping through a shell command or any of the many other methods emacs provides for turning text into text. It is emacs at its most emacs.)
If you don’t have Emacs, https://man.archlinux.org/man/extra/moreutils/vidir.1.en does the same with any editor.
Don't ignore the i here, I was reading the manpage for vdir and was wondering how this would help me edit directories.
Off topic, I guess, but I recently found that dired buffers can be editable with dired-toggle-read-only.
Basically just open a directory (or use find-dired) toggle read only, edit the filenames in the buffer and change access modes, and then toggle read only back on to save all the changes.
"oh yeah... good old M-x butterflies"
Jokes aside, cool emacs workflows you cover in such a small amount of text.
Zed allows this too, I found it handy a few times when I was still using it.
Ah, interesting! This is indeed the closest I've seen to what I wanted to do with this tool. Unfortunately, there's some weirdness with finding within the found lines, since it's already in find mode. So Cmd-D selects repeated instances, even across files (good!) but Cmd-E, Cmd-G doesn't change the search query to be what you had selected, it just keeps looking for the next instance of the original query (grrr).
Zed can definitely come close. I've been using it to do the same kind of edits while I digitize a cookbook from the 80s, which is itself a collection of copies. Your tooling setup looks much better though. I'm really amazed by the sublime plugin that shows the source line! That's much better than my flipping back and forth between the scanned png/pdf and text editor.
Yes, I was pretty amazed that it all worked, TBH! If I had it to do again, I might explore AWS Rekognition, which I believe provides OCRed text as well as bounding boxes. I'm not sure about the quality vs olmOCR, but if it were good enough, it could simplify matters.
Thanks! I feel like there must be someone who has done something similar in a GUI editor before, but I haven't found one with that feature. I was using BBEdit 25 years ago, but haven't touched it in quite awhile. There is a multi-file search view, which is quite nice, but still only allows you to edit one file at a time.
I know Sublime shows results in a text window, but while it's editable, it doesn't seem to actually commit any changes you make.
it doesn't seem to actually commit any changes you make.
It must be editable somehow, because the feature was cloned and made available in Vim/NeoVim as CtrlSF.vim
I've found the FindResultsApplyChanges package, which claims to do this. But it's not installing on 4200. And the Find Results buffer gets appended to on subsequent finds, so I'd be scared to use this for anything serious. Could easily find yourself clobbering previous changes without realizing it.
This is quite neat. What I’ve done before is just ripgrep -l and xargs the output into either neovim to :bufdo or just sed. Maybe there was already a tool to do this but I liked the multiview in one buffer.
I wonder how much time has been expended by humanity in the act of cleaning up OCR'd data? I've had to do it for a couple projects (and made some rough tools to help, somewhat like Okapi though not as good), I have several friends who have had to do it for projects, and I've read many a blog/forum post about others doing it.