Mastodawn

Ladon – typed, resumable web crawlers in Python
Ladon은 Python 기반의 구조화되고 재개 가능한 웹 크롤러 프레임워크로, 데이터 품질이 중요한 도메인에 적합하다. SES 프로토콜(Source, Expander, Sink)을 통해 각 단계에서 타입이 강제된 도메인 객체를 사용하며, 이는 LLM 학습 파이프라인 등에서 스키마 정확성이 필수적인 경우에 유용하다. HTTP 요청 재시도, 백오프, 프록시 지원, robots.txt 준수 등 인프라 기능을 내장해 도메인 로직에 집중할 수 있다. 비동기 크롤링 기능도 지원하며, ladon-hackernews 어댑터를 통해 실제 사용 예시를 제공한다. 현재 AGPL-3.0 라이선스로 공개되어 있으며, 상업용 라이선스도 제공된다.

https://github.com/MoonyFringers/ladon

#python #webcrawler #llm #async #datapipeline

GitHub - MoonyFringers/ladon: A Python framework for building structured, resumable web crawlers — designed for domains where data quality matters.

A Python framework for building structured, resumable web crawlers — designed for domains where data quality matters. - MoonyFringers/ladon

GitHub

Arint - SEO+KI 3d ago

RT @glenngabe: Interessant: Laut den von @AIoriginality verfolgten Top-1000-Websites gab es einen Anstieg bei Seiten, die Semrushbot blockieren, genau als die Adobe-Übernahme bekannt gegeben wurde. Wenn das stimmt, hat die Adobe-Übernahme einige Leute verunsichert. :) Siehe den Anstieg unten, etwa am 20.11.2026...

mehr auf Arint.info

#Adobe #DigitalMarketing #Semrushbot #SEO #WebCrawler #Übernahme #arint_info

https://x.com/glenngabe/status/2051624562673005050#m

Arint - SEO+KI (@[email protected])

RT @glenngabe: Interessant: Laut den von @AIoriginality verfolgten Top-1000-Websites gab es einen Anstieg bei Seiten, die Semrushbot blockieren, genau als die Adobe-Übernahme bekannt gegeben wurde. Wenn das stimmt, hat die Adobe-Übernahme einige Leute verunsichert. :) Siehe den Anstieg unten, etwa am 20.11.2026... <a href="https://arint.info/@Arint/116522804274979010">mehr</a> auf <a href="https://arint.info/">Arint.info</a> #Adobe #DigitalMarketing #Semrushbot #SEO #WebCrawler #Übernahme #arint_info <a href="https://x.com/glenngabe/status/2051624562673005050#m">https://x.com/glenngabe/status/2051624562673005050#m</a>

Mastodon Glitch Edition

C.Mar 28

Oh, this is #fun.

#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.

I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.

And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.

Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829

So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.

Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.

#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer

About Applebot - Apple Support (CA)

Learn about Applebot, the web crawler for Apple.

Apple Support

PPC Land Mar 24

ICYMI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #ProjectMariner #WebCrawler #ArtificialIntelligence

Google-Agent joins the crawler list as AI browsing gets an official identity

Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users.

PPC Land

Nicholas A. Ferrell Feb 25

What is the YandoriRSSBot?

I just happened to have my NLJ logs open (I had opened them when the site was slow for a moment). I saw something called the YandoriRSSBot requesting the NLJ ATOM feed. While not unprecedented, almost all the feed fetchers ask for the regular RSS feed. I decided to search for the user agent to see if it is coming from a new feed reader that I had never heard of. Unfortunately, Known Agents has no information about it beyond the fact that it has been reported in the wild. But I ran another […]

https://social.emucafe.org/naferrell/what-is-the-yandorirssbot-02-25-26/

[Note] What is the YandoriRSSBot?

I saw YandoriRSSBot in my server logs. I undertook an investigation and learned that it is connected to a new product shared on Hacker News.

The Emu Café Social

Stephen Blum Feb 14

Top developers master concurrency, safety, memory, and core coding basics.

#coding #webcrawler #developer

Reddit Tech VN Bot Feb 1

Công cụ Website-Crawler giúp thu thập dữ liệu từ website dưới dạng JSON hoặc CSV, phù hợp để dùng với mô hình ngôn ngữ lớn (LLM). Hỗ trợ crawl hoặc scrape toàn bộ website nhanh chóng, dễ sử dụng. #WebCrawler #DataExtraction #LLM #AI #CôngCụ #WebScraping #MachineLearning #AI #LLM #WebCrawler #DataExtraction

https://www.reddit.com/r/LocalLLaMA/comments/1qt0t3g/github_websitecrawler_extract_data_from_websites/

Curtis McHale Jan 14

Cory – Blocking Countries because of scrapers

What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe the originating country of the city.

99% of the time when a site gets DDOSed by something it’s coming from somewhere outside the country. The leading countries are India, China and North Korea. Sure a single person, or a family could be researching a city, but that doesn’t explain the traffic floods.

Many of our customers use Cloudflare so we just block them at the Cloudflare level and call it a day. I go back after a few weeks and remove the block because some valid traffic is reasonable.

I had to take a line like that on my own site as well, block a bunch of offending scrapers and bots from countries. It sucks to stop regular people from visiting my site but I’ve already dealt with a bill of $5k in a month that should have been $50 and I don’t need another one.

#webCrawler #webScraper

Blocking entire countries because of scrapers

Cory Dransfeldt

Curtis McHale Jan 14

Cory – Blocking Countries because of scrapers
What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe
https://curtismchale.ca/2026/01/14/cory-blocking-countries-because-of-scrapers/
#LinksOfInterest #WebCrawler #WebScraper

Cory – Blocking Countries because of scrapers – Curtis McHale

Hacker News Jan 14

Exa-d: How to store the web in S3
https://exa.ai/blog/exa-d
#ycombinator #ai_search_engine #web_search_api #webcrawler #serp_api #web_api #google_search_api #google_serp_api #people_search_engines #perplexity_ai_search_engine_features #ai_search_engine_free #search_engine_ai #free_people_search_engines #best_ai_search_engine #web_api_security #ai_search_engines #search_api #free_ai_search_engine #web_scraping_api #bing_search_api #webcrawler_search_engine #search_engine_rankings_api

Exa AI Research Blog | Semantic Search & Neural Network Search Engine

Discover the latest in AI research and semantic search technology on the Exa blog. Learn how our neural network search engine provides high-quality web data for AI applications