What is the YandoriRSSBot?

I just happened to have my NLJ logs open (I had opened them when the site was slow for a moment). I saw something called the YandoriRSSBot requesting the NLJ ATOM feed. While not unprecedented, almost all the feed fetchers ask for the regular RSS feed. I decided to search for the user agent to see if it is coming from a new feed reader that I had never heard of. Unfortunately, Known Agents has no information about it beyond the fact that it has been reported in the wild. But I ran another […]

https://social.emucafe.org/naferrell/what-is-the-yandorirssbot-02-25-26/

[Note] What is the YandoriRSSBot?

I saw YandoriRSSBot in my server logs. I undertook an investigation and learned that it is connected to a new product shared on Hacker News.

The Emu Café Social

Top developers master concurrency, safety, memory, and core coding basics.

#coding #webcrawler #developer

Công cụ Website-Crawler giúp thu thập dữ liệu từ website dưới dạng JSON hoặc CSV, phù hợp để dùng với mô hình ngôn ngữ lớn (LLM). Hỗ trợ crawl hoặc scrape toàn bộ website nhanh chóng, dễ sử dụng. #WebCrawler #DataExtraction #LLM #AI #CôngCụ #WebScraping #MachineLearning #AI #LLM #WebCrawler #DataExtraction

https://www.reddit.com/r/LocalLLaMA/comments/1qt0t3g/github_websitecrawler_extract_data_from_websites/

Cory – Blocking Countries because of scrapers

What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe the originating country of the city.

99% of the time when a site gets DDOSed by something it’s coming from somewhere outside the country. The leading countries are India, China and North Korea. Sure a single person, or a family could be researching a city, but that doesn’t explain the traffic floods.

Many of our customers use Cloudflare so we just block them at the Cloudflare level and call it a day. I go back after a few weeks and remove the block because some valid traffic is reasonable.

I had to take a line like that on my own site as well, block a bunch of offending scrapers and bots from countries. It sucks to stop regular people from visiting my site but I’ve already dealt with a bill of $5k in a month that should have been $50 and I don’t need another one.

#webCrawler #webScraper
Blocking entire countries because of scrapers

Cory Dransfeldt
Cory – Blocking Countries because of scrapers
What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe
https://curtismchale.ca/2026/01/14/cory-blocking-countries-because-of-scrapers/
#LinksOfInterest #WebCrawler #WebScraper
Cory – Blocking Countries because of scrapers – Curtis McHale

Exa AI Research Blog | Semantic Search & Neural Network Search Engine

Discover the latest in AI research and semantic search technology on the Exa blog. Learn how our neural network search engine provides high-quality web data for AI applications

🦀 Crab.so – công cụ crawler web miễn phí, nhẹ, dành cho SEO. Được phát triển như dự án phụ, chưa phải đối thủ Screaming Frog nhưng hữu ích cho kiểm tra site. Mọi phản hồi, đề xuất cải tiến đều hoan nghênh! #SEO #WebCrawler #CôngCụMiễnPhí #Crawl #SideProject #CôngCụSEO

https://www.reddit.com/r/SideProject/comments/1qc09ox/a_free_lightweight_screaming_frog_alternative/

I've checked on #YaCy from time to time because the project seemed very interesting but the resources (disk space and memory) too big for it to be run on cheap hardware as a hobby. I don't know of any other #OpenSource (optionally) #distributed #searchEngine with #webCrawler included (independent of Google and co., unlike metasearch engines).
I thought maybe somebody will rewrite it in Rust or something, but no luck so far. There was an announcement of significant optimisations made once, but the resources needed seem to be huge still.
Sadly, the focus nowadays seems to be on adding #AI to it. I guess I'll wait until the bubble is gone. 😕

Is there a standard hostname/domain to use in the documentation for a web spider? Ideally the host/domain should exist, have multiple webpages, and be OK with random traffic from people testing the web spider example code.

#webspider #webcrawler

Researchers Hack ChatGPT Memories and Web Search Features

attackers can set up a new website that is likely to show up in web search results for niche topics. ChatGPT relies on Bing and OpenAI’s crawler for web searches.

#chatgpt #openai #bing #webcrawler #security #cybersecurity #hackers #hacking #hacked

https://www.securityweek.com/researchers-hack-chatgpt-memories-and-web-search-features/

Researchers Hack ChatGPT Memories and Web Search Features

Rsearchers recently discovered seven new ChatGPT vulnerabilities and attack techniques that can be exploited for data theft.

SecurityWeek