Mastodawn

seen on HN: https://kage.tamnd.com/
kage renders every page in headless Chrome, snapshots the final DOM, removes every script and event handler, and downloads and rewrites the CSS, images, and fonts.

saves in ZIM Format, in the comments the author says it will support WARC too https://news.ycombinator.com/item?id=48529990

#webarchiving

kage

kage (影, shadow) clones any website into a self-contained folder you can browse offline, with all the JavaScript stripped out. Render in headless Chrome, remove every script, localise the CSS, images, and fonts, from one pure-Go binary.

raffaele 1d ago

Addio Digilander: il 9 giugno si spegne un pezzo di storia digitale dei primi internauti italiani
La storica piattaforma Libero Community, incluso Digilander, si prepara alla chiusura definitiva. Gli utenti dovranno salvare i contenuti prima della disattivazione.
https://www.libero.it/tecnologia/addio-digilander-libero-community-chiude-salvare-contenuti-116560 #webarchiving

Libero spegne Digilander: nostalgia per un’intera generazione

La storica Community di Libero si prepara alla chiusura definitiva. Dai blog a Digilander, ecco come salvare dati e contenuti prima del 9 giugno.

Self-Hosted Feed 5d ago

🗃️ wabarc/wayback

Archives web pages via CLI or chat bots (Telegram, Discord, Matrix, etc.) to Internet Archive, IPFS, archive.today and more, with Tor support and offline storage options

⭐ Stars: 2198
📅 Last Update: Jun 08, 2026

https://github.com/wabarc/wayback

#selfhosted #homelab #selfhost #selfhosting #opensource #webarchiving #cli

GitHub - wabarc/wayback: An archiving tool with an IM-style interface that prioritizes privacy and accessibility, integrated with various archival services including Internet Archive, archive.today, Ghostarchive, IPFS, Telegraph, and file systems.

An archiving tool with an IM-style interface that prioritizes privacy and accessibility, integrated with various archival services including Internet Archive, archive.today, Ghostarchive, IPFS, Tel...

GitHub

angelo Jun 6

I just wanted to share a (not so late) night rant with you.
In three days, the Italian web portal Libero.it is going to shut down thousands of early blogs that were originally created through the platform ItaliaOnline and later rebranded as Digilander.
I think this is a paradigmatic case of what we are going to experience more and more often in the near future. 1/

#digitaloblivion #webarchiving #Digilander #Italianwebhistory
#lostinternet #earlyblogs

IBI HU Berlin Jun 3

Können wir das Alter von Webseiten abschätzen, wenn uns nur ein Crawl zur Verfügung steht?

Diese Frage hat Ira Kokoshko und Robert Jäschke beschäftigt – und sie haben dazu den Ancient GeoCities Datensatz auf der Web Science 2026 in Braunschweig vorgestellt. Wie gut ein LLM bei der Schätzung des Alters von Webseiten performt, könnt ihr im Paper nachlesen:

https://dl.acm.org/doi/10.1145/3795766.3799783

#GeoCities #WebScience #WebArchiving

Tommi 🤯May 28

I was the first person to archive a webpage from Internet Archive Europe on the Internet Archive’s Wayback Machine.

LoL

#InternetArchive #InternetArchiveEurope #WaybackMachine #archive #archiving #WebArchiving #WebPreservation #inception

internetarchive May 27

“People aren’t sure what’s true, and what libraries are here for is to help with that.”

Brewster Kahle, digital librarian of the Internet Archive, discusses the future of the #WaybackMachine in ABC Radio National (🇦🇺 Australia)’s “Wayback Machine: The internet’s archive in peril,” a look at how media companies are restricting the preservation of the web itself.

🎧 Listen ⤵️
https://www.abc.net.au/listen/programs/sundayextra/wayback-machine/106604988

#InternetHistory #WebArchiving @abcaustraliarss @brewsterkahle

raffaele May 27

"Common Crawl mirrors its monthly crawl archive to the Hugging Face Hub as a Storage Bucket. Alongside the raw pages, it now publishes the columnar URL index — one parquet row per crawled page (host, language, MIME type, fetch status, and a pointer to the page's bytes). That makes the whole crawl queryable without touching the petabytes of underlying WARCs."
https://huggingface.co/spaces/davanstrien/common-crawl-april-2026
#webarchiving

The April 2026 Web by the Numbers - a Hugging Face Space by davanstrien

Query 2.19B Common Crawl pages with DuckDB, zero download

ResearchBuzz: Firehose May 21

NiemanLab: More than 340 local news outlets are limiting the Internet Archive’s access to their journalism. “Our new analysis shows that more than 340 local news sites across the United States are now limiting the Internet Archive’s ability to access and preserve their stories. Many sites in our sample are owned by five of the seven largest local news publishers in the country: USA Today […]

https://rbfirehose.com/2026/05/21/niemanlab-more-than-340-local-news-outlets-are-limiting-the-internet-archives-access-to-their-journalism/

NiemanLab: More than 340 local news outlets are limiting the Internet Archive’s access to their journalism

NiemanLab: More than 340 local news outlets are limiting the Internet Archive’s access to their journalism. “Our new analysis shows that more than 340 local news sites across the United State…

ResearchBuzz: Firehose

raffaele May 19

RE: https://fedihum.org/@aiucd/116600978409273225

#webarchiving