After almost two months, I’ve finished first version of #BeanBeaver: it parses grocery receipts into #Beancount records.
https://github.com/Endle/beanbeaver
I hope I’m not the only person who cares about a grocery-by-item breakdown 🦫
After almost two months, I’ve finished first version of #BeanBeaver: it parses grocery receipts into #Beancount records.
https://github.com/Endle/beanbeaver
I hope I’m not the only person who cares about a grocery-by-item breakdown 🦫
@nobodyinperson More than "might" ;-) This is essentially one large chunk of the same puzzle. @zhenboli would you be interested in collaborating in this space? I just recently started work on a project called acceptarium aiming to do roughly the same thing you seem to be doing but in a little bit more modular way (so e.g. how assets are stored, which OCR engine and extraction LLM is used, and what output format would be targeted are are adjustable.
Hi Caleb, I'm so excited to find that I'm not the only person interested in the breakdown of #beancount or other #plaintextaccounting
My current progress: Given this receipt
It generates such beancount output:
In the first stage, I'm using #PaddleOCR
https://github.com/PaddlePaddle/PaddleOCR
Their doc says they support Windows, macOS and Linux. For simplicity, I wrapped the python dependency into podman/docker, so it's Linux-only for now. If there are potential users other than me, I guess it won't be too hard to make it cross platform.
https://github.com/Endle/beanbeaver-ocr
Before PaddleOCR, I first tried #docTR
https://github.com/mindee/doctr
Some Reddit posts claimed that docTR was the best. It was pretty well for English (Latin characters), but it doesn't support Chinese. It would try to recognize a Chinese character as a combination of Latin characters with a relatively high confidence.
PaddleOCR supports Chinese recognize, but I turned it to English-only mode. For the T&T receipt I showed, PaddleOCR provides a very low confidence to Chinese words (https://github.com/Endle/beanbeaver/blob/master/demo/receipt_groups/tnt_20251202/receipt_20260217_200222_debug.png), so beanbeaver can parse this bilingual receipt by the English parts
@alerque PaddleOCR's output is bbox (bounding box).
Example: https://github.com/Endle/beanbeaver/blob/master/tests/receipts_e2e/loblaw_20260211_censor.ocr.json
I had to admit that I had no idea how to parse these bbox. I asked codex/claude to implement an OCR parser to read the bbox, and to generate a beancount file.
I would submit a receipt to beanbeaver, and tell AI to fix the parser logic. The most common case is that the parser should utilize X-Y data harder.
I have a private test set, which contains about 20 receipts and the expected output. I would run it after a big change to catch regressions. I've added some of redacted receipts to public repo https://github.com/Endle/beanbeaver/tree/master/tests/receipts_e2e
I didn't use the CUDA version of PaddleOCR. It's CPU model runs fast enough on my Linux PC (less than 10 seconds per receipt)
Currently the ocr_parser of beanbeaver is deterministic
https://github.com/Endle/beanbeaver/tree/master/receipt/ocr_parser