What is the YandoriRSSBot?
I just happened to have my NLJ logs open (I had opened them when the site was slow for a moment). I saw something called the YandoriRSSBot requesting the NLJ ATOM feed. While not unprecedented, almost all the feed fetchers ask for the regular RSS feed. I decided to search for the user agent to see if it is coming from a new feed reader that I had never heard of. Unfortunately, Known Agents has no information about it beyond the fact that it has been reported in the wild. But I ran another […]https://social.emucafe.org/naferrell/what-is-the-yandorirssbot-02-25-26/
Top developers master concurrency, safety, memory, and core coding basics.
Công cụ Website-Crawler giúp thu thập dữ liệu từ website dưới dạng JSON hoặc CSV, phù hợp để dùng với mô hình ngôn ngữ lớn (LLM). Hỗ trợ crawl hoặc scrape toàn bộ website nhanh chóng, dễ sử dụng. #WebCrawler #DataExtraction #LLM #AI #CôngCụ #WebScraping #MachineLearning #AI #LLM #WebCrawler #DataExtraction
Cory – Blocking Countries because of scrapers
What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe the originating country of the city.
99% of the time when a site gets DDOSed by something it’s coming from somewhere outside the country. The leading countries are India, China and North Korea. Sure a single person, or a family could be researching a city, but that doesn’t explain the traffic floods.
Many of our customers use Cloudflare so we just block them at the Cloudflare level and call it a day. I go back after a few weeks and remove the block because some valid traffic is reasonable.
I had to take a line like that on my own site as well, block a bunch of offending scrapers and bots from countries. It sucks to stop regular people from visiting my site but I’ve already dealt with a bill of $5k in a month that should have been $50 and I don’t need another one.
#webCrawler #webScraper🦀 Crab.so – công cụ crawler web miễn phí, nhẹ, dành cho SEO. Được phát triển như dự án phụ, chưa phải đối thủ Screaming Frog nhưng hữu ích cho kiểm tra site. Mọi phản hồi, đề xuất cải tiến đều hoan nghênh! #SEO #WebCrawler #CôngCụMiễnPhí #Crawl #SideProject #CôngCụSEO
https://www.reddit.com/r/SideProject/comments/1qc09ox/a_free_lightweight_screaming_frog_alternative/
Is there a standard hostname/domain to use in the documentation for a web spider? Ideally the host/domain should exist, have multiple webpages, and be OK with random traffic from people testing the web spider example code.