Können wir das Alter von Webseiten abschätzen, wenn uns nur ein Crawl zur Verfügung steht?

Diese Frage hat Ira Kokoshko und Robert Jäschke beschäftigt – und sie haben dazu den Ancient GeoCities Datensatz auf der Web Science 2026 in Braunschweig vorgestellt. Wie gut ein LLM bei der Schätzung des Alters von Webseiten performt, könnt ihr im Paper nachlesen:

https://dl.acm.org/doi/10.1145/3795766.3799783

#GeoCities #WebScience #WebArchiving

I was the first person to archive a webpage from Internet Archive Europe on the Internet Archive’s Wayback Machine.

LoL

#InternetArchive #InternetArchiveEurope #WaybackMachine #archive #archiving #WebArchiving #WebPreservation #inception

“People aren’t sure what’s true, and what libraries are here for is to help with that.”

Brewster Kahle, digital librarian of the Internet Archive, discusses the future of the #WaybackMachine in ABC Radio National (🇦🇺 Australia)’s “Wayback Machine: The internet’s archive in peril,” a look at how media companies are restricting the preservation of the web itself.

🎧 Listen ⤵️
https://www.abc.net.au/listen/programs/sundayextra/wayback-machine/106604988

#InternetHistory #WebArchiving @abcaustraliarss @brewsterkahle

"Common Crawl mirrors its monthly crawl archive to the Hugging Face Hub as a Storage Bucket. Alongside the raw pages, it now publishes the columnar URL index — one parquet row per crawled page (host, language, MIME type, fetch status, and a pointer to the page's bytes). That makes the whole crawl queryable without touching the petabytes of underlying WARCs."
https://huggingface.co/spaces/davanstrien/common-crawl-april-2026
#webarchiving
The April 2026 Web by the Numbers - a Hugging Face Space by davanstrien

Query 2.19B Common Crawl pages with DuckDB, zero download

NiemanLab: More than 340 local news outlets are limiting the Internet Archive’s access to their journalism. “Our new analysis shows that more than 340 local news sites across the United States are now limiting the Internet Archive’s ability to access and preserve their stories. Many sites in our sample are owned by five of the seven largest local news publishers in the country: USA Today […]

https://rbfirehose.com/2026/05/21/niemanlab-more-than-340-local-news-outlets-are-limiting-the-internet-archives-access-to-their-journalism/
NiemanLab: More than 340 local news outlets are limiting the Internet Archive’s access to their journalism

NiemanLab: More than 340 local news outlets are limiting the Internet Archive’s access to their journalism. “Our new analysis shows that more than 340 local news sites across the United State…

ResearchBuzz: Firehose

National Library of Finland: Principles for Finnish Web Archive content selection published. “The National Library of Finland is responsible for the diverse and representative preservation of online material. To make this work more transparent, we produced a document entitled Content selection for the Finnish Web Archive, outlining the principles for content selection in thematic and continuous […]

https://rbfirehose.com/2026/05/13/national-library-of-finland-principles-for-finnish-web-archive-content-selection-published/
National Library of Finland: Principles for Finnish Web Archive content selection published

National Library of Finland: Principles for Finnish Web Archive content selection published. “The National Library of Finland is responsible for the diverse and representative preservation of…

ResearchBuzz: Firehose

The web never stands still 🌐 ... and neither do the challenges of preserving it.

The #DPC is preparing for the return of its Web Archiving Special Interest Group (WA-SIG), bringing DPC Members together in a welcoming and transparent space where Members can exchange ideas, surface challenges, and learn from one another’s approaches.

The renewed WA-SIG gets together on 7 July.

Read more & join us 😊: https://www.dpconline.org/news/dpc-prepares-return-of-web-archiving-special-interest-group

#DigitalPreservation #Coalition #DPC #WebArchiving #Archives

DPC prepares return of Web Archiving Special Interest Group - Digital Preservation Coalition

Digital Preservation Coalition

Tom’s Hardware: Internet archival sites struggling to preserve the internet because of skyrocketing hard drive prices due to the AI boom — Wayback Machine and Wikimedia punished by stratospheric storage pricing and stricter anti-scraping measures blocking the wrong bots. “The internet is getting harder to archive because the AI boom has caused a storage crisis, with both NAND and mechanical […]

https://rbfirehose.com/2026/05/09/toms-hardware-internet-archival-sites-struggling-to-preserve-the-internet-because-of-skyrocketing-hard-drive-prices-due-to-the-ai-boom-wayback-machine-and-wikimedia-punished-by-stratosphe/
Tom’s Hardware: Internet archival sites struggling to preserve the internet because of skyrocketing hard drive prices due to the AI boom — Wayback Machine and Wikimedia punished by stratospheric storage pricing and stricter anti-scraping measures blocking the wrong bots

Tom’s Hardware: Internet archival sites struggling to preserve the internet because of skyrocketing hard drive prices due to the AI boom — Wayback Machine and Wikimedia punished by stratosphe…

ResearchBuzz: Firehose