After almost two months, I’ve finished first version of #BeanBeaver: it parses grocery receipts into #Beancount records.

https://github.com/Endle/beanbeaver

I hope I’m not the only person who cares about a grocery-by-item breakdown 🦫

GitHub - Endle/beanbeaver

Contribute to Endle/beanbeaver development by creating an account on GitHub.

GitHub
@alerque this might interest you

@nobodyinperson More than "might" ;-) This is essentially one large chunk of the same puzzle. @zhenboli would you be interested in collaborating in this space? I just recently started work on a project called acceptarium aiming to do roughly the same thing you seem to be doing but in a little bit more modular way (so e.g. how assets are stored, which OCR engine and extraction LLM is used, and what output format would be targeted are are adjustable.

https://codeberg.org/plaintextaccounting/acceptarium/

acceptarium

Tools to facilitate scanning receipts, extracting useful data, archiving the assets, and importing the results into plain text accounting systems.

Codeberg.org

@alerque

Hi Caleb, I'm so excited to find that I'm not the only person interested in the breakdown of #beancount or other #plaintextaccounting

My current progress: Given this receipt

https://github.com/Endle/beanbeaver/blob/master/demo/receipt_groups/tnt_20251202/receipt_20260217_200222.jpg

It generates such beancount output:

https://github.com/Endle/beanbeaver/blob/master/demo/receipt_groups/tnt_20251202/2025-12-02_t_t_supermarket_32_70.beancount

beanbeaver/demo/receipt_groups/tnt_20251202/receipt_20260217_200222.jpg at master · Endle/beanbeaver

Contribute to Endle/beanbeaver development by creating an account on GitHub.

GitHub

@alerque

In the first stage, I'm using #PaddleOCR

https://github.com/PaddlePaddle/PaddleOCR

Their doc says they support Windows, macOS and Linux. For simplicity, I wrapped the python dependency into podman/docker, so it's Linux-only for now. If there are potential users other than me, I guess it won't be too hard to make it cross platform.

https://github.com/Endle/beanbeaver-ocr

Before PaddleOCR, I first tried #docTR

https://github.com/mindee/doctr

Some Reddit posts claimed that docTR was the best. It was pretty well for English (Latin characters), but it doesn't support Chinese. It would try to recognize a Chinese character as a combination of Latin characters with a relatively high confidence.

PaddleOCR supports Chinese recognize, but I turned it to English-only mode. For the T&T receipt I showed, PaddleOCR provides a very low confidence to Chinese words (https://github.com/Endle/beanbeaver/blob/master/demo/receipt_groups/tnt_20251202/receipt_20260217_200222_debug.png), so beanbeaver can parse this bilingual receipt by the English parts

@alerque PaddleOCR's output is bbox (bounding box).

Example: https://github.com/Endle/beanbeaver/blob/master/tests/receipts_e2e/loblaw_20260211_censor.ocr.json

I had to admit that I had no idea how to parse these bbox. I asked codex/claude to implement an OCR parser to read the bbox, and to generate a beancount file.

I would submit a receipt to beanbeaver, and tell AI to fix the parser logic. The most common case is that the parser should utilize X-Y data harder.

I have a private test set, which contains about 20 receipts and the expected output. I would run it after a big change to catch regressions. I've added some of redacted receipts to public repo https://github.com/Endle/beanbeaver/tree/master/tests/receipts_e2e

beanbeaver/tests/receipts_e2e/loblaw_20260211_censor.ocr.json at master · Endle/beanbeaver

Contribute to Endle/beanbeaver development by creating an account on GitHub.

GitHub

@alerque

I didn't use the CUDA version of PaddleOCR. It's CPU model runs fast enough on my Linux PC (less than 10 seconds per receipt)

Currently the ocr_parser of beanbeaver is deterministic

https://github.com/Endle/beanbeaver/tree/master/receipt/ocr_parser

@zhenboli Deterministic is a great property! Even though I'd like to experiment with non-deterministic LLM based conversions (and some people have already reported great results) I am trying to put together a workflow where each piece at least has a deterministic alternative so people can go that route. Wrapping a call to PaddleOCR does seem like one good option.