If you're looking to help with Arcalibre development, it's now a bit easier to do so! The rereading/pyreading repo on Codeberg builds two Python packages that are used by Arcalibre, but that can be developed, built, and tested separately from any other Arcalibre code.

They're both Maturin-based, using Python to wrap some Rust logic so that it can be used with only glibc as a runtime dependency.

https://codeberg.org/rereading/pyreading/

pyreading

Python packages used by Arcalibre

Codeberg.org

The spellsnake package provides Python interface to spellbook, itself a hunspell-like spellchecking engine with minimal dependencies.

The gardenpath package provides a Python interface to the html5ever parser that *should* but doesn't yet allow for iterating over nodes with an ElementTree-compatible API as well as XPath-based searching.

For gardenpath in particular, my goal is to have a drop-in replacement for packages like html5lib and html5_parser. While they both work, html5lib is very slow for larger pages, and html5_parser is based on a deprecated HTML5 parsing library that would be good to swap out for something under active maintenance.
*Ideally* everything should work for local development if you just have PDM and Cargo up and running, *optionally* with rust-analyzer and the do-runner tool I made last night (`uv tool install do-runner`). All other dependencies should come in with `pdm sync --dev`.

My goal is to add more packages to the repo as things progress, separating out different parts of Arcalibre into standalone packages, so that you don't need to know all of the inner workings of Arcalibre to help out.

So far, it's just those two wheels, but that's a start!

Anyway, a number of folks had made kind offers to contribute, so I thought I'd mention a way it's become a bit easier! ♥

@xgranade is `lxml.html` an option? I don't know how it performs relative to `html5lib` but I've always been happy with its performance.

Of course, having some crab flavor in the pot here I'm sure allows some create Rust options too

@SnoopJ I tried, but there's some subtle issues with how that handles features that are in HTML5 but don't have an equivalent XHTML representation. There's not many HTML5 parsers that generate XML-like trees out there, and even fewer that are based on native code... the pure Python ones are cool but very *very* slow for some reason. I don't think that's true of Python in general, but for this application at least?

@xgranade Python's just kind of the wrong place to put a parser that needs that kind of throughput, IMO. That's the kind of domain that *really* benefits from deferral to an extension module and I doubt without a really good JIT that it can be competitive if done in Python. The overhead eats you alive.

Makes sense that `lxml` is too narrow for the general case, just figured I'd ask. I often use it to parse HTML5 that is not going to give me [exactly that headache]

@SnoopJ Yeah, no, it's a good suggestion, and would greatly simplify my life if it fit the bill. I tried a bunch of different things and, while I won't claim to have totally exhausted the space, I also wasn't able to find a solution that didn't involve writing a new crate.

The hack that html5_parser uses is to take an ABI dependence on lxml and builds a tree that way, but that *entirely* breaks how pip tracks dependencies when installing the final built wheel.

@xgranade my biggest love for `lxml` is its pretty-good support for XPath, if not maybe old? I don't really need things beyond XPath 1.0 very often, W3C kinda nailed it the first rev, y'know?

But it's always been happy to chew through whatever I feed it, even unreasonably large files.

Anyway, I should get more familiar with the state of the project and the challenge that merits this kind of thing before I go throwing around context-free suggestions!

@SnoopJ Yeah, that is very nice, I really wish I could go pure lxml for this...

Anyway, I went with xee-xpath for gardenpath, as that's based on an arena allocated XML representation thats really easy to manipulate without incurring huge overheads.