@theregister
Checklist for Protecting Content from AI Webcrawlers
1. Basic Preventive Steps (Easy, Low Effort)
Robots.txt file: Add directives like User-agent: * Disallow: / or specific AI bot blocks (e.g., User-agent: GPTBot).
Meta tags: Insert <meta name="robots" content="noai, notrain"> in your site’s HTML.
Copyright notice: Place a visible copyright line on your content.
2. Intermediate Safeguards (Moderate Effort)
Rate limiting & bot detection: Use services like Cloudflare or reCAPTCHA to detect/block unusual scraping patterns.
Invisible watermarks:
For text → subtle misspellings, metadata, or Unicode markers.
For images/audio/video → steganographic watermarks that don’t affect normal use.
Canary phrases: Insert harmless but unique markers (e.g., “AuthoredBy_Alice2025”) so if it shows up in a model, you know your data was ingested.
3. Advanced Legal & Technical Protections
Terms of Service (ToS): Explicitly forbid scraping or AI training use of your content.
Licensing banners: Display notices like “Not licensed for AI model training.”
Content provenance tools (C2PA / CAI): Embed cryptographic authorship data in your files.
4. Emerging / Community-Level Defenses
Opt-out registries: Join efforts (when available) that signal “do not crawl/do not train” at a broader level.
Collective blocks: Join with other creators/publishers to block AI-specific crawlers at the server/network level.
Legal monitoring: Stay aware of new regulations (e.g., EU AI Act) that may strengthen your rights.