Id like to put my lab servers to work archiving US federal data thats likely to get pulled - climate and biomed data seems mostly likely. The most obvious strategy to me seems like setting up mirror torrents on academictorrents. Anyone compiling a list of at-risk data yet?

edit (2025-09-21): this became https://sciop.net and its still going, in case anyone in this thread missed it

edit (2025-09-21 pt 2): do note the date on original post, nearly a year old, we have been rolling on sciop since February or so

#sciop

SciOp - Public Information Preservation

Preserving Public Information

Tons of good leads, thanks everyone, i'll start compiling a wiki page and seeing whats already in progress. More organized archival efforts will certainly lead but I want to see if theres a way to put all the random TBs sitting around in play
@jonny no, but please include me in that list when we do so I can help. I'll happily rebuild my home server for it.
@jonny (I wrote “no” but actually meant “I don’t know”)
@jonny I did see another person putting together a list here on Mastodon yesterday, so I know there's like-minded folks running around. I didn't bookmark it, tho...
@jonny that sounds like a fascinating project!

@jonny saw some post that may be relevant: https://kolektiva.social/@504DR/113443811224416535 which references via screenshot this one: https://med-mastodon.com/@mloxton/113442545703055556

I also think I saw some post talking about something that my memory says is connected yo your question, though I'm not sure if it was that post or some other - I bookmarked your post and hope I'll remember and find it when I find anything else

504 Battery Dr (@[email protected])

Attached: 1 image Start saving federally-funded data now.

kolektiva.social
Eira Tansey (@[email protected])

guys I'm not trying to harsh anyone's vibes but a lot of people tried the whole "downloading every bit of climate data we could find from a .gov website" this time 8 years ago and most of those projects didn't actually work out very well. For example, the original DataRescue project basically petered out https://www.datarefuge.org/ and EDGI had to narrow its scope https://envirodatagov.org/edgis-data-program-examines-the-role-of-environmental-data-governance-through-the-lenses-of-power-justice-and-equity/

glammr.us Mastodon
@jonny I have a 10TB hard disk in my desktop PC that's mostly unused so I could set up a background task to pull quite a lot down. But it would help if someone who knows more about the datasets and APIs etc had documentation on it.
@jonny I'm looking at the National Center for Environmental Information as one example and it looks like there are caps on the amount a single user can download as part of an 'order'. So coordination may be very important.
@jonny I am probably being overly paranoid about the capabilities of the new administration but I also wonder if Americans in particular need to consider anonymising the download - e.g. using a pseudonym and possibly a VPN.
@jonny And if so a VPN provider and server from outside the US.

@alastair @jonny
The next administration isn’t in office yet.

Rather than trying to secretly squirrel away with their data, why not ask somebody at the agency if there’s a way you can do it above board? They’d know how, be able to provide permission and/or access, and might even know what data is at risk. Or perhaps they’ll point out that other friendly countries already run public mirrors of the data for their teams, and won’t delete the data.

@alastair @jonny Not a guarantee, but could save a lot of time
@ClickyMcTicker @jonny The processes are all above board and documented (though I suggested thinking about anonymity). You can register for access as I did to get an API key that works for many agencies - you don't even have to be an American. Staff may be helpful though.
@ClickyMcTicker @jonny To clarify regarding the anonymity, it's not because the new administration is in office now - it's because the record of users will persist when they are.
@jonny I'd suggest data.gov, any climate related data (such as the datasets hosted by the NOAA), CDC data, FDA nutrition data or otherwise. Off the top of my head, anyway.
@4raylee @jonny NASA also has a lot of climate datasets

@jonny A great deal of federal data of all kinds is hosted on ArcGIS servers.

1. I curate a list of ArcGIS server addresses. There are weekly updates. The federal servers are sorted by department. The list is a PDF at
https://mappingsupport.com/p/surf_gis/list-federal-state-county-city-GIS-servers.pdf

2. A lot of that data can be download as a KMZ file by entering an ArcGIS "query" command into a browser. I wrote step-by-step instructions:
https://mappingsupport.com/p2/atak/pdf/atak_kml_link_arcgis.pdf

@jonny Not sure if anyone is already archiving CDC wastewater and other COVID information (and I guess now bird flu), but redundancy wouldn't be a bad thing. I expect basically all of that to be wiped.

I saw https://eotarchive.org/ - but not sure how much COVID info they're planning to save.

End of Term Web Archive

The End of Term Web Archive is a collaborative initiative that collects, preserves, and makes accessible United States Government websites at the end of presidential administrations.

End of Term Web Archive
@eladnarra @jonny Careful you guys don't slashdot the whole government.
@jonny The Obama administration set up all kinds of servers to transferring of data. I suspect those capabilities are being utilized now.
@jonny Why were people waiting until the last minute to do this? Unbelievable, as it happened during the last Trump administration. With him running again, one would think it deserved an earlier start than this.
@steter Feel free to go back in time to address it

@jonny Have you already seen this post from @researchfairy ? https://scholar.social/@researchfairy/113443840640821794

(Also, I think the two of you would get along swimmingly in many ways, if you somehow haven't already crossed paths on here.)

The research fairy (@[email protected])

Content warning: US politics adjacent; Clinical trials research

Scholar Social

@jonny
Can you set up a fund to buy physical storage media? Terabytes of SDDs, maybe? Sites like the Internet Archive (if they are willing to help) are great but if they ever get shut down, then so does your archived data.

Archiving your data in multiple, geographically separate places is important, too. One backup is effectively the same as zero backups.

@jonny I'm interested in this too. I agree about climate data, and probably data pertaining to underprivileged communities, are the most at risk. I don't know where to even look for such data.

If you use Lemmy and/or Reddit, I'd recommend asking in the /c/datahoarder and /r/datahoarders communities, respectively. I'd also recommend using the #archiving and #datahoarding hashtags in the Fediverse. In the meantime, I'll boost this and see if anyone else knows.

@hyperreal @jonny

As to where to look for climate data, a few are:
NOAA (tons of valuable climate/weather data there)
EPA
Globalchange.gov

There are a few places to look for civil rights stuff- I recommend prioritizing getting civil rights data off the Department of Education's site just because the Rs have been chomping at the bit to dissolve it for years and that website has a good chance of vanishing. Department of Justice would be another place I'd look.

@jonny Yeah, we should start soon. Is there a venue for folks to chat / discord?
@jonny Following because I’m interested in helping.

@jonny CDC data on not just COVID but the flu. i saw anomalies on the data being reported in 2018 and had no idea who to report to.

in scraping, look for what looks like dead links. Trump wanted them to misreport flu deaths. they had been out of control in 2017-2018 and was fighting with the CDC well into 2019 about the flu numbers.

then the pandemic happened.

try to scrape for what is not readily visible from 2017 onward.

@jonny Careful you don't get Aaron Swartzed.
@jonny
I'd expect at the very least FTC, CDC, NIH, DoH

@jonny

Some international #OpenAccess science data repositories initiatives should take the lead to facilitate global mirrors...

https://en.wikipedia.org/wiki/Scholarly_Publishing_and_Academic_Resources_Coalition

https://neuromatch.social/@INCF/113435735755117885

Scholarly Publishing and Academic Resources Coalition - Wikipedia

@jonny

totally agree w your points. good luck, you prolly know that omfg NCBI has so much data.
i bet NOAA/NWS are not much different
plus the climate dot gov data.

so much information will be lost... it's insane

https://www.ncbi.nlm.nih.gov/guide/data-software/

at least copernicus and ensembl are european.

Data & Software - Site Guide - NCBI

@jonny i have a few free TB and my uTorrent is always on. Count me in!
@jonny If people start gathering a list make sure to share please Im starting my own list asap
@jonny last time around (2017) https://github.com/datarefuge was coordinating a lot of archiving.
Realistically, the Climate data was pulled first, shortly followed by everything on NSF/NIH related to health, drug trials, public health, gun violence, crime statistics and anything that could be used to fact-check the narrative of the elected candidate.
Data Refuge

Repositories that will be useful for, or were made as part of the Data Rescue Philly - Data Refuge

GitHub
@jonny I have 16 TB to burn and a fat pipe. Happy to host a European mirror.
@jonny It is probably worth reaching out to archiveteam: https://wiki.archiveteam.org/
Archiveteam

@jonny I'd also be happy to help out! Not sure if people are coordinating this effort somewhere, but would love to join.

@jonny Not having that much spare space in the lab... but I got a couple 10s of TB on my "hobby project"; Any recommendations or coordinated projects atm, one can contribute to?

Grab domain, setup site (to list decentral mirrors), mirror data?

@jonny There's again an End of Term Presidential Harvest and it's accepting nominations.
https://digital2.library.unt.edu/nomination/eth2024/about/

Some of the past datasets seem amenable to torrenting.
https://eotarchive.org/data/

#digipres

Nomination Tool: About Project

@jonny why not use IPFS? could be more resilient

@jonny

Saw this late: I'm a data librarian and I know at lot about it and also how much I don't know. You probably want to contact EDGI.

For US climate large source data, primary things are:

1) Inventory of GHG Emissions and Snks: required by Paris Agreement which Trump plans to leave

https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks

1/3

Inventory of U.S. Greenhouse Gas Emissions and Sinks | US EPA

The national greenhouse gas inventory is developed each year to track trends in U.S. emissions and removals. Find emissions by source, economic sector and greenhouse gas.

US EPA

@jonny

2) GHGRP (GHG Reporting Program): legislatively mandated by rider to appropriaitons, could be de-legislated the same way

https://www.epa.gov/ghgreporting

Greenhouse Gas Reporting Program (GHGRP) | US EPA

Site provides information on EPA's GHG reported data starting with RY 2010 to the present. The site also provides information on regulatory requirements, applicability, how to register a facility and report data, and how to access the GHG data.

US EPA

@jonny

3) various agency data collections focussed on electricity gen sources. Who knows what mandates them, prob vulnerable to Chevron deference

a) e--GRID

https://www.epa.gov/egrid

b) DOE data

https://www.eia.gov/

That's just source data, not NOAA, satellites, monitoring, models, etc.

Emissions & Generation Resource Integrated Database (eGRID) | US EPA

Data about the electric power generated in the United States. The data includes air emissions for nitrogen oxides, sulfur dioxide, carbon dioxide, methane, and nitrous oxide; emissions rates; net generation; resource mix; and and many other attributes.

US EPA
@jonny @kaoudis boosting this since I’m also interested. Think I have a couple TB to spare on the NAS.
@jonny nuclear data from Oak Ridge. We lost access during the government shutdown. I foresee efficiency mandates to make it user pays. #archiving #datahoarding
@jonny You may want to chat with @Lydie
Edit: just saw this is nearly a year old 🤪
@jonny
That's brilliant, thank you.
@jonny, full year helps. 25/09/21 looks too much like 25/09/2021…
@lp0_on_fire
Updated, though there is only one true ordering of date parts ;)