📸🤦‍♂️ Nathan Rooy discovers that flashy websites are like McDonald's cheeseburgers: popular for being just "good enough." Instead of a gourmet web experience, it's a buffet of #mediocrity sourced from Common Crawl's greatest hits. Web connoisseurs, prepare to feast on the bland! 🍔💻
https://nry.me/posts/2025-10-09/small-web-screenshots/ #flashywebsites #webdesign #cheeseburgers #CommonCrawl #HackerNews #ngated
One million (small web) screenshots

One million (small web) screenshots

nry.me
The Company Quietly Funneling #Paywalled Articles to #AI Developers
#CommonCrawl's website states that it scrapes the internet for "freely available content" without "going behind any '#paywall.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their #LLMs on high-quality journalism for free.
In #2020, #OpenAI used Common Crawl’s archives to train #GPT3.
https://www.msn.com/en-us/money/news/the-company-quietly-funneling-paywalled-articles-to-ai-developers/ar-AA1PMBHE
MSN

Mashable: Common Crawl accused of feeding paywalled content to AI companies. “In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes.”

https://rbfirehose.com/2025/11/09/mashable-common-crawl-accused-of-feeding-paywalled-content-to-ai-companies/

Mashable: Common Crawl accused of feeding paywalled content to AI companies | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz
Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good commoncrawl.org/blog/setting-t… #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)
Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good https://commoncrawl.org/blog/setting-the-record-straight-common-crawls-commitment-to-transparency-fair-use-and-the-public-good #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)
Common Crawl defends archive practices amid deletion claims: Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion. https://ppc.land/common-crawl-defends-archive-practices-amid-deletion-claims/ #CommonCrawl #DataArchive #Nonprofit #DigitalPreservation #DataCollection
Common Crawl defends archive practices amid deletion claims

Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion.

PPC Land
Common Crawl supplies paywalled content to AI companies despite publisher objections: Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals. https://ppc.land/common-crawl-supplies-paywalled-content-to-ai-companies-despite-publisher-objections/ #AI #Journalism #DataEthics #Paywall #CommonCrawl
Common Crawl supplies paywalled content to AI companies despite publisher objections

Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.

PPC Land
Common Crawl Is Doing the AI Industry’s Dirty Work

“You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” Common Crawl’s executive director says.

The Atlantic

„Auch Roboter sind Menschen.“, das sagt Rich Skrenta, Geschäftsführer von Common Crawl – einer gemeinnützigen Organisation, die Milliarden von Webseiten durchsucht und angeblich eine Hintertür für KI-Modelle geschaffen hat, um diese heimlich mit Artikeln hinter Bezahlschranken zu trainieren. Skrenta erklärte gegenüber „The Atlantic“ in „The Nonprofit Doing the AI Industry’s Dirty Work“ v. 04.11.2025, Anfragen zur Entfernung solcher Inhalte aus der Datenbank seien „total lästig“, und argumentiert, Bots sollten kostenlos alles lesen dürfen.

Kunden von Common Crawl sind u.a. OpenAI, Google, Anthropic, Nvidia, Meta und Amazon.

AI steals everything, everywhere...mehr fällt mir dazu gerade nicht ein, ausser: 🤮🤮🤮🤮🤮

#kishit #ki #ai #openai #google #meta #amazon #anthropic #nvidia #commoncrawl

Mehrere französische #Medienhäuser protestieren gegen die unautorisierte Nutzung ihrer Inhalte durch #KI-Systeme.

Besonders im Fokus stehen frei zugängliche Datenbanken wie #CommonCrawl, deren Inhalte zum Training von #Sprachmodellen genutzt werden.

Die #Verlage fordern die Entfernung urheberrechtlich geschützter Inhalte und kündigen rechtliche Schritte an.

https://www.n-tv.de/ticker/Frankreichs-Medien-protestieren-gegen-die-illegale-Nutzung-von-Inhalten-durch-die-KI-article26002424.html

#Urheberrecht #KünstlicheIntelligenz #Frankreich #Verwertungsrechte

Durch Urheberrecht geschützt: Frankreichs Medien protestieren gegen die illegale Nutzung von Inhalten durch die KI

Französische Zeitungen und Zeitschriften protestieren gegen die illegale Nutzung ihrer Inhalte durch KI-Programme.

n-tv NACHRICHTEN