Mastodawn

#Development #Challenges
Webspace invaders · Let’s level up our anti-AI scraping game! https://ilo.im/16ahl8

_____
#AI #Crawlers #RobotsTxt #RateLimiting #WAFs #Cloudflare #IndieWeb #WebDev #Frontend #Backend

Webspace Invaders · Matthias Ott

There’s something happening on the Web at the moment that almost feels like watching that old arcade game Space Invaders play out across our servers. Bots and scrapers marching in formation, attacking our servers wave after wave, systematically requesting page after page, relentlessly filling their data stores while we watch our access logs fill up.

Matthias Ott – User Experience Designer

Virebent 4d ago

📝 New article: Why We Reject Google: Our Anti-Surveillance SEO Policy

An in-depth look at why Virebent.art deliberately blocks Google and other surveillance-based crawlers, and our strategy for visibility in a privacy-first web.

🔗 https://www.virebent.art/blog/seo-policy.html

#antiseo #robotstxt #surveillancecapitalism

Why We Reject Google: Our Anti-Surveillance SEO Policy | Virebent.art

Our SEO strategy is an anti-surveillance strategy. Learn why we block mainstream crawlers and how we build visibility on an ethical, privacy-first web.

Habr Jan 26

[Перевод] Тихая смерть robots.txt

Десятки лет robots.txt управлял поведением веб-краулеров. Но сегодня, когда беспринципные ИИ-компании стремятся к получению всё больших объёмов данных, базовый общественный договор веба начинает разваливаться на части. В течение трёх десятков лет крошечный текстовый файл удерживал Интернет от падения в хаос. Этот файл не имел никакого конкретного юридического или технического веса, и даже был не особо сложным. Он представляет собой скреплённый рукопожатием договор между первопроходцами Интернета о том, что они уважают пожелания друг друга и строят Интернет так, чтобы от этого выигрывали все. Это мини-конституция Интернета, записанная в коде. Файл называется robots.txt; обычно он находится по адресу вашвебсайт.com/robots.txt . Этот файл позволяет любому, кто владеет сайтом, будь то мелкий кулинарный блог или многонациональная корпорация, сообщить вебу, что на нём разрешено, а что нет. Какие поисковые движки могут индексировать ваш сайт? Какие архивные проекты могут скачивать и сохранять версии страницы? Могут ли конкуренты отслеживать ваши страницы? Вы сами решаете и объявляете об этом вебу. Эта система неидеальна, но она работает. Ну, или, по крайней мере, работала. Десятки лет основной целью robots.txt были поисковые движки; владелец позволял выполнять скрейпинг, а в ответ они обещали привести на сайт пользователей. Сегодня это уравнение изменилось из-за ИИ: компании всего мира используют сайты и их данные для коллекционирования огромных датасетов обучающих данных, чтобы создавать модели и продукты, которые могут вообще не признавать существование первоисточников. Файл robots.txt работает по принципу «ты — мне, я — тебе», но у очень многих людей сложилось впечатление, что ИИ-компании любят только брать. Cегодня в ИИ вбухано так много денег, а технологический прогресс идёт вперёд так быстро, что многие владельцы сайтов за ним не поспевают. И фундаментальный договор, лежащий в основе robots.txt и веба в целом, возможно, тоже утрачивает свою силу.

https://habr.com/ru/companies/ruvds/articles/987416/

#robotstxt #вебкраулер #crawling #openai #ruvds_перевод

Тихая смерть robots.txt

Десятки лет robots.txt управлял поведени��м веб-краулеров. Но сегодня, когда беспринципные ИИ-компании стремятся к получению всё больших объёмов данных, базовый общественный договор веба начинает...

Хабр

teufelswerk Jan 24

Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

#teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz

Frontend Dogma Jan 22

Generative AI, by @christianliebel and @yash-vekaria.bsky.social and others (@httparchive.org):

https://almanac.httparchive.org/en/2025/generative-ai

#webalmanac #studies #research #metrics #ai #robotstxt #llmstxt

Generative AI | 2025 | The Web Almanac by HTTP Archive

Generative AI chapter of the 2025 Web Almanac covering the transition to local browser-based AI, the adoption of WebNN and Built-in AI, new discoverability standards like llms.txt, and the emergence of AI fingerprints on the web.

Layar Kosong Jan 20

Panduan memahami tiga opsi Cloudflare untuk konfigurasi robots.txt: Content Signals Policy, Instruct AI bots to not scrape, dan Disable configuration. Pelajari cara memberi instruksi pada AI crawler.

#fediverse #Repost #WartaTekno #Mengelola #Robotstxt #Cloudflare

https://dalam.web.id/artikel/cloudflare-robots-txt-control

Mengelola Robots.txt di Cloudflare: Kontrol Akses AI Bots 📜

Panduan memahami tiga opsi Cloudflare untuk konfigurasi robots.txt: Content Signals Policy, Instruct AI bots to not scrape, dan Disable configuration. Pelajari cara memberi instruksi pada AI crawler.

Layar Kosong

Frontend Dogma Jan 19

SEO, by @httparchive.org:

https://almanac.httparchive.org/en/2025/seo

#webalmanac #studies #research #metrics #seo #robotstxt #llmstxt #links #content #structureddata #amp #html #internationalization

SEO | 2025 | The Web Almanac by HTTP Archive

SEO chapter of the 2025 Web Almanac covering crawlability, indexability, page experience, on-page SEO, links, AMP, internationalization, and more.

Maciek Jan 18

https://contentsignals.org is a nice idea, but if my reading of RFC 9309 is correct, it might lead to agent-specific blocks being ineffective for bots that don't recognise content signals, because in case of multiple sections of robots.txt matching, the "allow" rules take precedence over the "disallow" rules.

#robotsTxt

Content Signals

An up-to-date guide to the IETF's proposed new AI Preferences (aipref): a new way for website publishers to control how automated systems use their content.

Show thread

Leonardo Di Ottio Jan 12

@piccalilli My (admittedly cynical) assumption is that they will still hoover up anything they can find on your site, they’re just no longer showing it to anyone outside Google.

#Google #RobotsTxt #SEO