Sökmotorerna öppnade upp webben för att sen centralisera den. Nästan i samma stund som webben lanserades kom de första sökmotorerna. De var enkla och lanserades ihop med katalogsajter och sökte framförallt bland de sajter och länkar som fanns med i den egna katalogen.

https://blog.zaramis.se/2026/06/05/sokmotorerna-oppnade-upp-webben/
Why Google’s new AI-saturated search page will be a disaster

Google didn’t invent full-text search of the Internet – that honour belongs to early pioneers such as WebCrawler, Lycos and AltaVista. But for the last 25 years or so, Google has been synonymous with online searching, providing the quickest and most effective way to find things online (although its results may be getting worse.) More recently, it has been adding to its search engine more […]

#agentic #agents #ai #altavista #blackBox #chatbot #creators #dependency #google #interface #links #llms #lycos #magazines #newspapers #publishing #search #training #webcrawler #worldWideWeb https://walledculture.org/why-googles-new-ai-saturated-search-page-will-be-a-disaster/

Spider v1.0.0 released.

Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.

URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.

But, Spider does not stop at web crawling...

File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.

More info:
https://forum.hashpwn.net/post/52

#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking

RT @glenngabe: Interessant: Laut den von @AIoriginality verfolgten Top-1000-Websites gab es einen Anstieg bei Seiten, die Semrushbot blockieren, genau als die Adobe-Übernahme bekannt gegeben wurde. Wenn das stimmt, hat die Adobe-Übernahme einige Leute verunsichert. :) Siehe den Anstieg unten, etwa am 20.11.2026...

mehr auf Arint.info

#Adobe #DigitalMarketing #Semrushbot #SEO #WebCrawler #Übernahme #arint_info

https://x.com/glenngabe/status/2051624562673005050#m

Arint - SEO+KI (@[email protected])

<p>RT @glenngabe: Interessant: Laut den von @AIoriginality verfolgten Top-1000-Websites gab es einen Anstieg bei Seiten, die Semrushbot blockieren, genau als die Adobe-Übernahme bekannt gegeben wurde. Wenn das stimmt, hat die Adobe-Übernahme einige Leute verunsichert. :) Siehe den Anstieg unten, etwa am 20.11.2026...</p> <p><a href="https://arint.info/@Arint/116522804274979010">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#Adobe #DigitalMarketing #Semrushbot #SEO #WebCrawler #Übernahme #arint_info</p> <p><a href="https://x.com/glenngabe/status/2051624562673005050#m">https://x.com/glenngabe/status/2051624562673005050#m</a></p>

Mastodon Glitch Edition

Oh, this is #fun.

#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.

I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.

And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.

Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829

So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.

Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.

#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer

About Applebot - Apple Support (CA)

Learn about Applebot, the web crawler for Apple.

Apple Support
ICYMI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #ProjectMariner #WebCrawler #ArtificialIntelligence
Google-Agent joins the crawler list as AI browsing gets an official identity

Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users.

PPC Land

What is the YandoriRSSBot?

I just happened to have my NLJ logs open (I had opened them when the site was slow for a moment). I saw something called the YandoriRSSBot requesting the NLJ ATOM feed. While not unprecedented, almost all the feed fetchers ask for the regular RSS feed. I decided to search for the user agent to see if it is coming from a new feed reader that I had never heard of. Unfortunately, Known Agents has no information about it beyond the fact that it has been reported in the wild. But I ran another […]

https://social.emucafe.org/naferrell/what-is-the-yandorirssbot-02-25-26/

[Note] What is the YandoriRSSBot?

I saw YandoriRSSBot in my server logs. I undertook an investigation and learned that it is connected to a new product shared on Hacker News.

The Emu Café Social

Top developers master concurrency, safety, memory, and core coding basics.

#coding #webcrawler #developer

Công cụ Website-Crawler giúp thu thập dữ liệu từ website dưới dạng JSON hoặc CSV, phù hợp để dùng với mô hình ngôn ngữ lớn (LLM). Hỗ trợ crawl hoặc scrape toàn bộ website nhanh chóng, dễ sử dụng. #WebCrawler #DataExtraction #LLM #AI #CôngCụ #WebScraping #MachineLearning #AI #LLM #WebCrawler #DataExtraction

https://www.reddit.com/r/LocalLLaMA/comments/1qt0t3g/github_websitecrawler_extract_data_from_websites/

Cory – Blocking Countries because of scrapers

What the title says, Cory is blocking countries due to misbehaved scrapers. We do a bit of this at work, blocking misbehaving countries when they flood our sites with traffic. There is very little reason that anyone should be visiting the website of a city unless they live in the city, maybe the city next door, and maybe the originating country of the city.

99% of the time when a site gets DDOSed by something it’s coming from somewhere outside the country. The leading countries are India, China and North Korea. Sure a single person, or a family could be researching a city, but that doesn’t explain the traffic floods.

Many of our customers use Cloudflare so we just block them at the Cloudflare level and call it a day. I go back after a few weeks and remove the block because some valid traffic is reasonable.

I had to take a line like that on my own site as well, block a bunch of offending scrapers and bots from countries. It sucks to stop regular people from visiting my site but I’ve already dealt with a bill of $5k in a month that should have been $50 and I don’t need another one.

#webCrawler #webScraper
Blocking entire countries because of scrapers

Cory Dransfeldt