Real-time request logs are now live in the Byteful dashboard.

Every request your scraper fires, you see as it happens. Filter by status code, response time, or target domain, with 7 days of history on every one. Click into a request and you get the headers, timing, which proxy it routed through, and the final status.

Logging is async, so it won't add latency. Built for the people debugging the pipeline at 2am.

Live now: https://dashboard.byteful.com

#webscraping #proxies #dataengineering #devtools

Byteful Dashboard

Byteful dashboard

AI agents need access to public web data, right now. Tools connected to web scraping APIs empower agents to return live data quickly. https://www.zyte.com/blog/building-robust-agentic-ai-workflows-with-rapid-web-data?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

Building robust agentic AI workflows with rapid web data

AI agents need access to public web data, right now. Tools connected to web scraping APIs empower agents to return live data quickly.

Zyte

A viral clip from Claude Code's creator put a name to something a lot of us have been circling: loop engineering. Here's why web scraping may be its best-fit domain — and what that means in practice. https://www.zyte.com/blog/now-what-exactly-is-loop-engineering-and-where-do-anthropics-fable-5-model-and-web-scraping-fit-in?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

Now what exactly is loop engineering? And where do Anthropic's Fable 5 model and web scraping fit in?

A viral clip from Claude Code's creator put a name to something a lot of us have been circling: loop engineering. Here's why web scraping may be its best-fit domain — and what that means in practice.

Zyte
Web scraping gets blocked by weak headers, broken sessions, poor IP reputation, fast requests, and careless proxy rotation. https://hackernoon.com/why-scrapers-fail-headers-sessions-ip-reputation-and-request-patterns #webscraping
Why Scrapers Fail: Headers, Sessions, IP Reputation, and Request Patterns | HackerNoon

Web scraping gets blocked by weak headers, broken sessions, poor IP reputation, fast requests, and careless proxy rotation.

It Worked on My Machine (Literally)

I have a TRMNL on my desk. If you haven’t seen one, it’s a little e-ink display from trmnl.com that shows you whatever you tell it to: your calendar, the weather (but in Haiku form), a far side comic, a random Studio Ghibli picture. The whole device runs on plugins, and the nice thing is you can write your own. I’d been meaning to build a TRMNL plugin since I got my device, and I finally landed on an idea that was small enough to actually finish: show what I’m currently reading on StoryGraph.

Just three things, really. My profile name, what I’m currently reading, and the next couple of books in my to-read pile. That’s it. A small project. I even said the words “basic, simple plugin” out loud, which in hindsight was me daring the universe.

The plan

TRMNL plugins can fetch their data a few different ways. The one that fit was polling: TRMNL pings a URL on a schedule, gets back some JSON, and renders it with a Liquid template. So I needed a small server that returns my reading data as JSON, plus the templates to lay it out on the screen.

The catch: StoryGraph doesn’t have a public API. No tidy endpoint to call. If I wanted the data, I’d have to scrape it off my public profile page. I found a reference project, storygraph-api, that does exactly this, and it gave me the lay of the land: the URLs to hit (`/currently-reading/username`, `/to-read/username`) and the HTML structure of a book on the page.

I wanted to keep this lightweight. Plain Ruby where I could, a real framework only if I needed one. For a service with two or three JSON routes, plain Ruby plus Rack is plenty. No Rails, no Hanami, just a Rack app and Nokogiri to parse the HTML. Easy.

The first wall

Before writing a line of application code, I did the one thing I always tell other people to do: I tested the riskiest assumption first. Could I even fetch a StoryGraph page?

$ curl https://app.thestorygraph.com/profile/christine_s HTTP 403

Hm. I added a browser User-Agent. Still 403. I added the full set of Chrome headers, the `sec-ch-ua` bits, a cookie jar, all of it. Still 403. Then I looked at the response headers and saw the actual story:

cf-mitigated: challenge server: cloudflare

StoryGraph sits behind a Cloudflare managed challenge. My polite little `curl` request was getting waved off at the door before it ever reached their servers. And here’s the part that surprised me: it wasn’t about the headers at all. Cloudflare was fingerprinting the TLS handshake itself. Real browsers negotiate TLS in a particular, recognizable way (the cipher order, the extensions, the whole shape of the “hello”), and `curl` does it differently. You can spoof every header in the world and you’ll still look like a robot, because the give-away happens one layer down, before any headers are sent.

The thing that actually worked

The fix turned out to be a tool I’d never had a reason to use before: curl-impersonate. It’s `curl` rebuilt to mimic a real browser’s TLS fingerprint exactly. Same ciphers, same curves, same handshake shape as Chrome.

$ curl_chrome136 -s -o /dev/null -w '%{http_code}' \ https://app.thestorygraph.com/currently-reading/elliek 200

Two hundred. The door opened. Watching that `403` flip to `200` was easily the most satisfying moment of the whole project. The challenge wasn’t checking who I claimed to be, it was checking how I spoke, and now I had the correct vocab.

Building the actual thing

With the hard part de-risked, the rest came together quickly, which is how these things usually go once the scary unknown is gone.

The service is a small Rack app. One real endpoint, `/reads.json`, that takes a username and a limit. It fetches two pages through curl-impersonate, hands the HTML to a Nokogiri scraper that pulls out each book’s title, author, and cover, and returns a clean little JSON payload. There’s a `/health` route and a tiny index page, and that’s the whole surface area.

A few decisions I’m happy with:

  • Caching. Scraping is slow and I didn’t want to hammer StoryGraph every time TRMNL polls. An in-memory cache with a thirty-minute TTL means repeated polls cost nothing and I stay a good citizen.
  • Failing soft. If a scrape fails, the endpoint still returns `200` with an `error` field instead of a `500`. A blank e-ink screen tells you nothing. A screen that says “couldn’t load, is the profile public?” at least tells you where to look.
  • Retries. StoryGraph occasionally drops a rapid second request, so the fetcher retries with a short backoff.

Then the templates. TRMNL supports four layout sizes (full, two halves, and a quadrant), and I wrote Liquid for each, with the empty and error states baked in so the display always has something sensible to show. I wrapped it all in a Docker image that installs the right curl-impersonate build for the architecture, and I had a passing test suite running against saved HTML fixtures so I wasn’t hitting the network on every run.

It worked. Locally, it really worked.

The second wall (this one was my fault)

I pointed the scraper at my own profile and got a redirect to a sign-in page. My books were nowhere.

It took me an embarrassing minute to realize: my StoryGraph profile was private. Of course it was. Public profiles scrape fine; private ones bounce you to the login wall, exactly as they should. The fix was a single toggle in my StoryGraph settings, and suddenly there I was in JSON form: Eloquent Ruby, Effective Testing with RSpec 3, The Staff Engineer’s Path. Reader, my to-read pile is exactly as on-brand as you’d expect.

To see it on the actual device, I ran the container locally and pointed a cloudflared tunnel at it, which gave me a temporary public URL to paste into TRMNL. A minute later my little e-ink screen lit up with my current reads. I may have done a small chair dance.

The twist

The tunnel was never meant to be permanent (it runs off my laptop, and the URL changes every time it restarts), so the next step was deploying somewhere real. I built the Docker image for Fly.io, set my username, and shipped it.

The health check was green. The scrape failed. Every. single. time.

Same code. The exact same image that had just pulled my books down on my Mac, now returning “couldn’t load the profile” from the cloud, over and over. I retried. I checked the profile was still public. I stared at it for a while.

Then it clicked, and it’s the lesson I keep coming back to. curl-impersonate beats Cloudflare’s fingerprint check. It does nothing about Cloudflare’s IP reputation check. My Mac sits behind a residential IP that looks like a person. Fly’s machines sit on datacenter IP ranges that Cloudflare knows perfectly well belong to a hosting provider, and it blocks them on sight, accent or no accent. The request from my laptop and the request from Fly were byte-for-byte identical in every way I controlled. The only difference was where they came from, and that difference was the whole game.

It worked on my machine. The single most clichéd sentence in software, and here it was, completely literal and completely true.

What I actually learned

The code was never the hard part. I spent maybe an afternoon on the Rack app, the scraper, the templates, all of it. I spent far longer learning that a request has properties I’d never had to think about: the shape of its handshake, the reputation of the address it leaves from. Those live underneath the application entirely, and no amount of clean code touches them.

There are real ways forward from here. I could run it from a residential connection (an always-on box at home behind a stable tunnel). I could route the outbound requests through a service that provides residential IPs and handles the Cloudflare dance for me. Each is a tradeoff between cost, complexity, and how much of my own hardware I want babysitting a reading list. For now, the laptop tunnel does the job, and I’ve left the deploy config in the repo for when I commit to a permanent home.

I’m planning to share the code once it’s had the cleanup it badly needs. It works, but “works” and “ready to show people” are two different states, and right now there are a few rough edges I’d rather not hand to anyone. When it’s tidied up I’ll post the repo, so if you want to build something similar for your own TRMNL, keep an eye out.

But the plugin itself is done, and it’s genuinely lovely to glance over at my desk and see what I’m reading rendered in crisp e-ink. A simple project that turned into a short tour of everything that happens to a web request before your code ever sees it. I’ll take it. I just won’t call the next one “simple” out loud.

#cloudflare #docker #flyIo #rack #ruby #storygraph #trmnl #webScraping

No-one likes an out-of-touch AI assistant. Fortunately, rapid refreshing can keep AI models aware of the very latest public information. https://www.zyte.com/blog/enhancing-ai-model-performance-with-fresh-web-data?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

Enhancing AI model performance with fresh web data

No-one likes an out-of-touch AI assistant. Fortunately, rapid refreshing can keep AI models aware of the very latest public information.

Zyte

AI agents can generate code, suggest selectors, and draft crawl logic. What they can't do is design the system that decides when to stop, what to trust, and how to recover when the web pushes back. That job still belongs to a human. https://www.zyte.com/blog/whats-becoming-of-web-scraping-developers-in-the-age-of-ai-agents?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

What's becoming of web scraping developers in the age of AI agents?

AI agents can generate code, suggest selectors, and draft crawl logic. What they can't do is design the system that decides when to stop, what to trust, and how to recover when the web pushes back. That job still belongs to a human.

Zyte
Building superior AI models with quality web data

Training data quality can make or break AI model effectiveness. So how are engineers sourcing the web’s best input?

Zyte
News Sites Are Blocking Internet Archive Over AI Scraping Fears

Especially in this era of the Internet, the role of the Internet Archive’s Wayback Machine has become increasingly essential as more and more web content vanishes into the ether or is surrept…

Hackaday