Mastodawn

Yeah that's a good observation. XML's closing tags give the model structural anchors during generation — it knows where it is in the nesting. JSON doesn't have that, so the deeper the nesting the more likely the model loses track of brackets.

We see this especially with arrays of objects where each object has optional nested fields. The model will get 18 items right and then drop a closing bracket on item 19, or a invalid field of wrong type. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.

Show thread

andrew_zhong 3d ago

Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.

Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

Show thread

andrew_zhong 3d ago

Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.

Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

andrew_zhong 3d ago

Show HN: Robust LLM Extractor for Websites in TypeScript

We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.

LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:

- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise.
- LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.
- Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages.
- You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.

We got tired of rebuilding this stack for every project, so we extracted it into a library.

Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:

- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning
- Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.)
- Uses Zod schemas for type-safe extraction with real validation
- Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19
- Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches
- Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction

We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.

GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.

Happy to answer questions or hear feedback.

https://github.com/lightfeed/extractor

GitHub - lightfeed/extractor: Using LLMs and AI browser automation to robustly extract web data

Using LLMs and AI browser automation to robustly extract web data - lightfeed/extractor

GitHub

Official	https://
Support this service	https://www.patreon.com/birddotmakeup