Mastodawn

Lambert Heller Jun 5, 2025

1000 Dank meinen Kolleg*innen @mickylindlar , @tobschalle und @israelh , über deren #DarkArchive von @arXiv bei uns an der @tibhannover ich für die Hannoversche Allgemeine interviewt wurde. 🔬

Der Anlass: unser Techniksalon von TIB und @unihannover heute Abend. Mit @lavaeolus und @fuzzyleapfrog werde ich über das Retten von Forschungsdaten vor der Staatsmacht sprechen. Kommt vorbei oder schaltet den Livestream ein und fragt/kommentiert live. 🤗📺

https://openbiblio.social/@tibhannover/114585512050604490

#hannover

Christian May 16, 2025

🚨 Science is under threat and our Library @tibhannover is stepping up! To safeguard vital research, TIB has built a dark archive of #arXiv, preserving millions of scientific preprints in case the main platform goes down. A great step for resilient, decentralized access to knowledge worldwide!

#library #science #darkarchive #USPol #OpenScience #DigitalPreservation #ResearchIntegrity

https://blog.tib.eu/2025/05/14/protecting-science-tib-builds-dark-archive-for-arxiv/

Protecting Science: TIB builds Dark Archive for arXiv - TIB-Blog

Research and science are international; it is not for nothing that we speak of international specialist communities. Although a service such as arXiv is operated by an institution based in the USA, namely Cornell University, it is used by researchers worldwide. Part of arXiv‘s funding has also been internationalised since 2010 with the introduction of arXiv membership. The TIB finances the German contribution together with the Helmholtz Association of German Research Centres (HGF) and the Max Planck Society (MPG). The TIB has now set up a so-called dark archive for the arXiv content in order to make the backed-up data accessible in the event that the data located in the USA is lost.

TIB-Blog

Esther Tobschall May 14, 2025

Protecting Science: TIB builds Dark Archive for arXiv

diesen Beitrag auf Deutsch lesen

Research and science are international, hence we are speaking of international scientific communities. A service such as arXiv might be operated by a US-based institution, Cornell University, but arXiv is being used by researchers worldwide, as, e.g., impressively evidenced by the submission statistics. Moreover, since the introduction of arXiv Membership in 2010, the funding of arXiv has been partially internationalised. TIB funds the German contribution, together with the Helmholtz Associaton of German Research Centres (HGF) und the Max Planck Society (MPG).

What is arXiv?

The platform arXiv.org is a freely accessible online archive for scientfic preprints, i.e. publications of scientific works that have not yet (fully) been peer-reviewed. The arXiv preprint service holds great importance for providing information to physics, mathematics, computer science, and neighbouring subjects. Via arXiv, researchers are able to access the latest research results, even before their actual publication in a quality-assured scientific journal. Since its founding in 1991 as the first online preprint service, arXiv serves as a model for the development of preprint services in other subjects (cf. Rzayeva et al. 2025, https://doi.org/10.31235/osf.io/xdwc4_v2).

So when the Trump administration makes decisions that have fatal consequences for science and research in the US, the repercussions reach far beyond the Gulf of Mexico: Over the last days, reports are mounting in German media that attest to researchers not only fearing the loss of data , but also the loss of established information portals such as PubMed.

Research data under threat

Initiatives such as ”Safeguarding Research and Culture” are scrambling to save threatened research data and websites for scientific communities and for posteriority. Contents under threat range from the social sciences (e.g. research on LBGTQIA+ topics) and medicine (e.g. vaccines) to the natural sciences (e.g. climate research).

While it is research linked to political debates that is subject to the most blatant and egregious reprisals, in principle all research can be threatened by ”cost cutting” and restructuring measures. This is evidenced e.g. by the planned shutdown of the renowned 120 year old atomic spectroscopy group at the National Institute of Science and Technology (NIST).

Decentral scientific infrastructures

Unfortunately, a further escaltion of the already dismal curtailing of academic freedom in the US appears to be likely. Not at least due to the great importance of US institutions in the international academic system, these developments affect research infrastructures worldwide. As ”Safeguarding Research and Culture” are writing in their mission statement, this warrants a change of mind, among other things towards more decentralised and thus more resilient infrastructures.

For arXiv a system which could have helped for at least some time had been in place until last year: In the early days of the internet, which were also the early days of arXiv, besides the main server arxiv.org there existed a network of arXiv mirror sites, distributed around the globe that allowed access to a copy of arXiv contents that were closer to the user location, geographically. A legendary example was the Augsburg arxiv mirror which often convinced with its shorter access and reply latencies.

With years of technical progress, the differences in performance between the local mirrors (amongst others at the European Organization for Nuclear Research (CERN), at Los Alamos National Laboratory (LANL), in France, and Japan) and the main server arxiv.org flattened out. Resulting in more than 90 % of the traffic going via the main server and little usage of the mirror sites. Thus, in the view of the arXiv team, the expense for maintenance and updating of the mirrors was no longer matched by their ”utility and utilization”, as can be read in an arXiv blog entry under ”Attention arXiv users: arXiv mirrors to shut down September 15th, 2024”.

After the arXiv system had been migrating to a completely cloud-centric architecture for its services over the last couple of years, those responsbile for arXiv came to the conclusion that

“The arXiv mirror network served a role – acting as a backup for the corpus, allowing some degree of load distribution, and providing improved access for users who were geographically closer to a mirror – that is no longer necessary. arXiv now has multiple backups for the arXiv corpus in place, and the Fastly CDN (Content Delivery Network) that we use to deliver content provides excellent service throughout the world.“

As a European institution, we have always taken a bit of a different view – and the recent developments, unfortunately, appear to confirm our reservations – and have always advocated for preserving the mirrors, while also looking for alternatives. Some processes turned out to be cumbersome and complicated, e.g. also due to legal constraints regarding licencing. (Open Access is not absolutely Open Access if authors have granted arXiv an exclusive right for provision.) Some others, we might be able to explore further.

Why TIB is archiving arXiv data

What we have implemented over the last few weeks, is to build a Dark Archive of arXiv contents:

As a first step in building a Dark Archive, the rights clearance needs to be addressed, of course. Here, TIB had already commissioned a legal advisory survey back in 2016, in the context of a possible cooperation with arXiv.org. This included studying the licences used by arXiv, which broadly fall into the categories “arXiv.org licence” , “Creative Commons“, and “Public Domain“.

While nothing stands in the way of archiving the data and metadata as such, the status of these rights would have to be explored in detail if they were to be made accessible in the context of a public-facing service. This is especially relevant for resources under arXiv licences, since this licence type over the course of the years underwent several versions. Between the years 1991 and 2003, users were even able to upload data objects without explicitly stating a licence.

But before a user service can be even set up, the data need to be ingested into the TIB infrastructure. Here, arXiv itself offers several methods for full texts. Since both PDF and (La)Tex sources ought to be part of the TIB Dark Archive, we have opted for the download via Amazon S3. This is a possibility arXiv offers as a “requester pays buckets” method – meaning that TIB as the fetching entity covers the expenses arising with Amazon Web Services (AWS) https://info.arxiv.org/help/bulk_data_s3.html. For 2,686,172 fetched datasets with a data volume of just under 10 terabytes, the S3 transfer came to about 900 Euros.

arXiv website

Because metadata from arXiv have since a long time been used as a data source for the TIB portal, there was no need to establish a new workflow. Eventually, this also facilitates making the datasets accessible via the TIB-Portal. A possibility for this is, e.g., supplying the arXiv datasets in the TIB portal with a second download link in the background. In case the first download link pointing to the arXiv source is no longer accessible, the second link should come into play, pointing to the now existing copy at TIB. Users of the TIB-Portal could thus seemlessly access arXiv records, even in case of an outage of the main platform over at Cornell. As mentioned earlier, this accessibility is however contingent of the specific licences.

Moreover, after the first complete transfer of the arXiv holdings, a process needs to be implemented which in regular intevals fetches new, additional arXiv records as well as versioning information for already existing records.

“Building a Dark Archive is an expression of our longstanding commitment for a reliable, international academic provision, and as a partner of arXiv. Even though the Dark Archive today only works in the background, it is a key element in safeguarding digital research contents in the long term, because in case of a crisis, we could open the archive.”

Dr Irina Sens, Deputy Director of TIB

Dark Archive: Data stored, but not openly accessible

The data are being stored, but if push comes to shove it would need some more steps to make them publicly available. Because a database service is much more than a mere backup copy of the data: Operating a productive user-facing service not only needs technical resources, but first and foremost a committed team which in the background takes care of diverse aspects such as quality assurance, content curation, or (technical) development.

In the case of arXiv, there are not only the accessibility of the papers and the search functionality, the upload services for authors, and further technical services. Rather, it is the integration within the scientfic communities that is the heart of arXiv: Numerous researchers who volunteer to take on roles on various boards, for content moderation or as Volunteer Developers!

#arXiv #DarkArchive #data #LizenzCCBY40INT

2023 institution submissions - arXiv info

Esther Tobschall May 13, 2025

Die Wissenschaft schützen: TIB baut Dark Archive für arXiv auf

read this article in English

Forschung und Wissenschaft sind international, nicht umsonst spricht man von internationalen Fachcommunities. Ein Dienst wie arXiv wird zwar von einer in den USA ansässigen Einrichtung, der Cornell University, betrieben, genutzt wird arXiv aber von Forschenden weltweit, wie zum Beispiel die Statistik der Submissions eindrücklich beweist. Auch ein Teil der Finanzierung von arXiv ist seit 2010 mit Einführung der arXiv Membership internationalisiert worden. Die TIB finanziert den deutschen Beitrag gemeinsam mit der Helmholtz-Gemeinschaft Deutscher Forschungszentren (HGF) und der Max-Planck-Gesellschaft (MPG).

Was ist arXiv?

Die Plattform arXiv.org ist ein frei zugängliches Online-Archiv für wissenschaftliche Preprints, also Vorabveröffentlichungen von wissenschaftlichen Arbeiten, die noch nicht oder noch nicht endgültig begutachtet (Peer-Review) wurden. Der Preprint-Serverdienst arXiv hat große Bedeutung in der Informationsversorgung für die Physik, Mathematik und Informatik und angrenzende Fächer. Über arXiv haben Forschende die Möglichkeit, auf neueste Forschungsergebnisse noch vor der eigentlichen Veröffentlichung in einem qualitätsgesicherten Journal zugreifen zu können. Seit seiner Gründung 1991 als erster Online-Preprintdienst dient arXiv eine Vorbild für die Entwicklung von Preprint-Diensten in anderen Fächern (vgl. Rzayeva et al. 2025, https://doi.org/10.31235/osf.io/xdwc4_v2).

Wenn also die Trump-Regierung Entscheidungen fällt, die fatale Folgen für Wissenschaft und Forschung in den USA haben, hat das auch Konsequenzen weit über den Golf von Mexiko hinaus: In den letzten Tagen häufen sich in deutschen Medien Meldungen, die von der Furcht der Forschenden nicht nur vor Datenverlusten, sondern auch vor dem Verlust etablierter Informationsportale wie zum Beispiel PubMed zeugen.

Gefährdete Forschungsdaten sichern

Initiativen wie „Safeguarding Research and Culture“ bemühen sich, bedrohte Forschungsdaten und Websites für die Forschungscommunities und die Nachwelt zu sichern. Die bedrohten Inhalte reichen von den Sozialwissenschaften (zum Beispiel Forschung zu LGBTQIA+-Themen), Medizin (zum Beispiel Impfstoffe)
bis zu den Naturwissenschaften (zum Beispiel Klimaforschung). Während Forschung mit Bezügen zu politischen Debatten den offensichtlichsten und schärfsten Repressalien ausgesetzt ist, kann prinzipiell jede Forschung von „Einsparungs-“ und Umstrukturierungsmaßnahmen bedroht sein. Dies zeigt sich zum Beispiel an der geplanten Schließung der traditionsreichen Atomspektroskopie-Gruppe am National Institute of Science and Technology (NIST).

Dezentrale Infrastrukturen in der Wissenschaft

Leider ist von einer weiteren Eskalation der schon jetzt bedrückenden Einschränkungen der Wissenschaftsfreiheit in den USA auszugehen. Nicht zuletzt aufgrund der großen Bedeutung US-amerikanischer Institutionen im internationalen Wissenschaftssystem betreffen diese Entwicklungen Forschungsinfrastrukturen weltweit. Wie „Safeguarding Research and Culture“ in ihrem Mission Statement schreiben, erfordert dies ein Umdenken, unter anderem hin zu dezentralisierten und dadurch resilienteren Infrastrukturen.

Für arXiv gab es bis zum letzten Jahr ein System, das hier zumindestens zeitweise hätte helfen können: In den Anfangszeiten des Internets, die ja auch die Anfangszeiten von arXiv waren, gab es neben dem Hauptserver arXiv.org ein weltweit verteiltes Netz von Spiegeln oder arXiv mirror sites, die es ermöglichten, auf einen geografisch näheren Abzug der arXiv-Inhalte zuzugreifen. Legendär war hier zum Beispiel der Augsburger arXiv-Spiegel de.arXiv.org, der häufig mit kürzeren Zugriffs- und Antwortzeiten überzeugte.

Mit den Jahren und dem technischen Fortschritt ließen sich dann aber keine Unterschiede mehr zwischen der Performance der lokalen Spiegel (unter anderem bei der Europäischen Organisation für Kernforschung CERN, oder am Los Alamos National Laboratory (LANL) in Frankreich und Japan) und dem Hauptserver arXiv.org festzustellen, sodass über neunzig Prozent des Traffics über diesen Server lief und die Spiegel nur noch wenig genutzt wurden. Der Aufwand für die Pflege und Aktualisierung der Spiegel stand damit nach Ansicht des arXiv-Teams in keiner angemessenen Relation mehr, wie im arXiv-Blog unter Attention arXiv users: arXiv mirrors to shut down September 15th, 2024 nachzulesen ist.

Nachdem das arXiv-System in den letzen Jahren auf eine vollständig cloud-zentrierte Architektur für seine Dienste umgestiegen ist, kamen die arXiv-Verantwortlichen zum Ergebnis, dass

„The arXiv mirror network served a role – acting as a backup for the corpus, allowing some degree of load distribution, and providing improved access for users who were geographically closer to a mirror – that is no longer necessary. arXiv now has multiple backups for the arXiv corpus in place, and the Fastly CDN (Content Delivery Network) that we use to deliver content provides excellent service throughout the world.“

Als in Europa angesiedelte Einrichtung haben wir das schon immer etwas anders gesehen – und die aktuellen Entwicklungen scheinen unsere Vorbehalte leider zu bestätigen – und haben uns immer für den Erhalt der Spiegel eingesetzt bzw. uns nach Alternativen umgesehen. Einige Prozesse haben sich dabei leider als langwierig und schwierig herausgestellt, zum Beispiel auch aufgrund der lizenzrechtlichen Rahmenbedingungen. (Open Access ist nicht unbedingt Open Access, wenn die Autor:innen alleine arXiv das Recht zu Bereitstellung gegeben haben.) Andere werden eventuell noch weiterverfolgt werden können.

Warum die TIB arXiv-Daten archiviert

Was wir aber in den letzten Wochen umgesetzt haben, ist ein Dark Archive der arXiv-Inhalte aufzubauen:

Im ersten Schritt muss im Falle eines Dark-Archives-Aufbaus natürlich die Rechteklärung erfolgen. Hier hatte die TIB bereits 2016 ein Rechtsgutachten im Rahmen einer möglichen Kooperation mit arXiv.org in Auftrag gegeben. Hierbei wurden auch die von arXiv genutzten Lizenzen untersucht, die sich grob in die Kategorieren „arXiv.org Lizenz“ , „Creative Commons“ und „Public Domain“ unterteilen lassen. Während der eigentlichen Archivierung der Daten und Metadaten nichts im Wege steht, muss diese Rechtesituation sodann bei der Zugänglichmachung auf die Daten im Rahmen eines Services näher untersucht werden. Dies gilt insbesondere für die mit den arXiv-Lizenzen versehenen Objekten, da dieser Lizenztyp über die Jahre auch verschiedene Versionen durchlaufen hat. In den Jahren 1991 bis 2003 wurden Objekte sogar ohne ausdrückliche Lizenz von Nutzer:innen hochgeladen.

Bevor aber ein Nutzungsdienst zukünftig überhaupt erst aufgesetzt werden kann, müssen die Daten zunächst in die TIB-Infrastruktur geladen werden. Hierzu stellt arXiv selbst für die Volltexte verschiedene Methoden bereit. Da sowohl PDF als auch (La)TeX Sourcen Bestandteil des TIB Dark Archives sein sollen, haben wir uns für den Download über Amazon S3 entschieden. Diese Möglichkeit bietet arXiv als „Requester Pays Buckets“-Methode – dies bedeutet, dass die TIB als abholende Entität die bei Amazon Web Services (AWS) anfallenden Kosten übernimmt. Für die 2.685.172 abgeholten Datensätze mit knapp unter 10 Terabyte Datenvolumen sind im S3 Transfer circa 900 Euro angefallen.

Da die Metadaten aus arXiv schon seit längerer Zeit eine feste Datenquelle des TIB-Portals sind, musste kein neuer Workflow hierfür eingerichtet werden. Dies erleichtert perspektivisch auch die Zugänglichmachung der Datensätze über das TIB-Portal. Eine Möglichkeit hierzu ist beispielsweise die Hinterlegung eines zweiten Download-Links im Hintergrund der arXiv-Datensätze im TIB-Portal. Ist der erste Downloadlink, welcher auf die arXiv-Quelle zeigt, nicht mehr erreichbar, so soll der zweite Link greifen, der auf die nun an der TIB vorliegende Kopie zeigt. Für Nutzende des TIB-Portals ist damit ein nahtloser Zugriff auf die arXiv-Records möglich, auch bei Ausfall der eigentlichen Plattform bei Cornell. Wie weiter oben erwähnt ist diese Zugänglichmachung aber abhängig von den jeweiligen Lizenzen.

Ebenso muss nun nach der ersten Komplettabholung des arXiv-Bestands ein Prozess eingerichtet werden, welcher neu hinzukommende arXiv-Datensätze sowie Versionierungsinformationen für bereits vorhandene Datensätze regelmäßig abholt.

„Der Aufbau eines Dark Archives ist Ausdruck unseres langjährigen Engagements für eine verlässliche, internationale Wissenschaftsversorgung und als Partnerin von arXiv. Auch wenn das Dark Archive heute nur im Hintergrund arbeitet, ist es ein entscheidender Baustein für die langfristige Absicherung digitaler Forschungsinhalte, denn im Krisenfall können wir das Archiv öffnen.“

Dr. Irina Sens, stellvertretende Direktorin der TIB

Dark Archive: Daten gespeichert, aber nicht öffentlich zugänglich

Die Daten sind also vorhanden, aber im Fall der Fälle bedürfte es einiger weiterer Schritte, um sie öffentlich machen zu können. Ein Datenbankdienst ist schließlich viel mehr eine bloße Sicherungskopie eines Datensatzes: Für den Betrieb im Sinne der Forschenden werden nicht nur technische Ressourcen benötigt, sondern vor allem ein engagiertes Team, welches sich der vielfältigen Aspekte im Hintergrund wie zum Beispiel Qualitätssicherung oder inhaltlicher und technischer Weiterentwicklung annimmt.

Im Falle von arXiv gibt es nicht nur die Zugänglichkeit der Artikel und die Suchfunktion, die Upload-Services für Autor:innen und weitere technische Dienste. Vielmehr ist die Verankerung in der wissenschaftlichen Community das Herz von arXiv: Eine Vielzahl von Forschenden, die sich in diversen Gremien, für die Moderation der Inhalte oder als freiweillige Entwicker:innen engagiert! Dieses gesamte „Ökosystem“ eines Dienstes neu zu verwurzeln, wäre die weitaus größere Aufgabe als eine Sicherungskopie der Daten unter einer neuen URL zugänglich zu machen. Daher gilt es gleichermaßen, das öffentliche Bewusstein für die Wissenschaftsfreiheit zu schärfen, wie auch wissenschaftsintern, die Bedeutung von Diensten wie arXiv zu würdigen – und sie, so gut wie möglich, resilient und nachhaltig zu machen.

#arXiv #DarkArchive #DigitaleLangzeitarchivierung #LizenzCCBY40INT #OpenAccess