Guess where I am. https://netpreserve.org/ga2024/
Guess where I am. https://netpreserve.org/ga2024/
🇳🇱 🇨🇭
Now, panel at #WAC "Archiving Social Media In An Age of APIcalypse" (Twitter closed its API).
[Wondering how to archive the fediverse.]
Facebook closed its API long ago, pretending it was because of Cambridge Analytica (but the real reason was commecial), shutting down many research projects on social media.
Several speakers in the panel do not follow the title: they talk about what they did *before*, when API use was possible.
Anat Ben-David, on the contrary, explains what could be done in the future. APIs were not so good, after all (for instance, you are never sure of what they hide).
Jérôme Thièvre (INA) is the first to mention Bluesky (which apparently has a working API but little content).
TikTok has an API but its use requires that all research papers where it was used have to be pre-approved by TikTok!
In the end, back to traditional Web scraping, and data donations.
(Medialab SciencesPo currently crawls and scraps #Doctissimo.)
Now, back to identifiers, at #WAC "The Potentials and Challenges for Researchers and Web Archives Using the Persistent Web IDentifier (PWID)" (The speakers even have T-shirts branded "PWID".)
(The english-speaking Wikipedia page on PWID is not the expected one.)
An example of PWID:
urn:pwid:archive.org:2016-01-22T10:08:23Z:page:https://www.dr.dk
(from the Internet-Draft)
The Internet-Draft is expired and there is no plan to revive it. The only specification for #PWID seems to be https://www.iana.org/assignments/urn-formal/pwid
Now, "bit preservation" (presevving bits for the long term, not taking semantics into account).
* several copies (and no SPOF)
* check them
"Decentralized Web Archiving and Replay via InterPlanetary Archival Record Object (IPARO)" is an attractive title.
If I understand correctly, the goal is to put Internet Archive on IPFS (as WARC files).
But IPNS (IPFS naming system) has limits. Hence the new type, the IPARO, a list of IPFS names pointing to the various versions of an archived Web page.
Plus links going directlty to some points in the list (for resiliency, since IPFS does not guarantee persistence.)
ReproZip-Web, a tool to capture (in the mathematical sense) all what is needed to reproduce a very dynamic Web site, with dependencies. (Something that ordinary crawlers cannot get.)
https://github.com/reprozip-news-apps/reprozip-web
"You need a Linux operating system and it can be difficult to get access at such a system.."
Example done with a Web site depending on PHP 5 :-)
Attached: 1 image Facebook gave Netflix all of your private messages in exchange for all your watch history. #fediverse rocks.