We've been collecting and mirroring what we can find of public data scrapes of data that has recently gone missing from federal sites or is likely to in the near future. The repos here include public data from CDC, NIH, and NOAA. Be warned that some of these repos are quite large!

https://git.lsit.ucsb.edu/publicdata

#datascience #cdc #nih #noaa

publicdata

Archives of Public Data Sets

Git for LSIT at UCSB
Stay tuned. There's more on the way!
Now including data from #DeptEd as well. Added more CDC data. More on the way!

https://git.lsit.ucsb.edu/publicdata/DeptEd
DeptEd

Public Data from Department of Education and NCES.

Git for LSIT at UCSB

Here's the additional #CDC data that wasn't included in the original dump, with some of these reports going back decades, as well as reports on LGBTQ and HIV/AIDS.

https://git.lsit.ucsb.edu/publicdata/CDC-Data-2025/src/branch/main/other_reports

CDC-Data-2025

US Centers for Disease Control and Prevention data archive from January 2025.

Git for LSIT at UCSB
climate-gov-data

Archived data from https://www.climate.gov

Git for LSIT at UCSB

Added more DeptEd and CDC data today and added globalchange.gov data.

https://git.lsit.ucsb.edu/publicdata/globalchange-gov

globalchange-gov

globalchange.gov archives

Git for LSIT at UCSB
I do appreciate those who having pointed out some data that I've missed or otherwise haven't archived yet. Please do let me know if you see such things. Unfortunately, some data has use restrictions and I'm only hosting public data here. If it's not public domain or clearly marked creative commons, etc, then I can't host it here.

@vwbusguy
Are you on bluesky? This request came across my feed:

β€ͺData Rescue 2025‬ β€ͺ@datarescue2025.bsky.social‬
Β·
1h
If any of you are on mastodon, please help us connect with people who might be doing similar work. we don’t have the capacity to have both accounts at this time. We could use some data rescue ambassadors.

@d_himes Nope. I'm mainly on fediverse these days. But you're welcome to pass on the link!
@d_himes Do they have an IRC, Matrix, or Signal group?
Data Rescue Efforts

Data Rescue Project Updated 2025-02-20 These suggestions come from various sources, including IASSIST, RDAP, Data Curation Network, BlueSky, LinkedIn, and others. You are free to take information from this page, but if you plan to copy it please credit the work as Data Rescue Project and point p...

Google Docs
@d_himes As it so happens, I'm familiar with that particular document and you can see a link already there (which I added myself earlier this week), so if it's the group I think it is, I'm already in communication. πŸ™‚
@vwbusguy is this affiliated with archive team at all?

@aburka I'm not sure what "archive team" means, but we're not the only higher ed folks doing this. We're all working to survive this!

University of Washington, for example:

https://github.com/UW-CALMA/datarescue

GitHub - UW-CALMA/datarescue

Contribute to UW-CALMA/datarescue development by creating an account on GitHub.

GitHub

@vwbusguy https://wiki.archiveteam.org/

I've been running one of their containers this week grinding on the US government project

Archiveteam

@aburka That's awesome! It's been a mix of things here. Stuff that's already in a git repo somewhere is cake to mirror because we are already using @forgejo so no worries about Microsoft (GitHub) taking ours down. CDC dump from Internet Archive. Other stuff was fetched with various bespoke stuff I cobbled together yesterday and today (javascript console in Firefox, wget -r, etc.).
@vwbusguy is anyone bittorrenting this?
Scott Williams 🐧 (@[email protected])

It would be cool for #forgejo to have a way to generate and seed a torrent for a repo release tag. #git #torrent #bittorrent

Mastodon
@Aminorjourney All that to say, if someone felt motivated to setup a torrent of this data, it would be a great way to help out. The more people that get this data, the more likely it is to survive.
@vwbusguy Hey! I'm downloading the ERIC database right now; I didn't realize I was late to the party. How can I send it your way?

@ashtonandrepont Given that ERIC is not public domain, CC, etc., I probably can't host it here, unless you are only fetching the public domain articles.

https://eric.ed.gov/?copyright

ERIC - Content Disclaimers – Website and FAQs

ERIC is an online library of education research and information, sponsored by the Institute of Education Sciences (IES) of the U.S. Department of Education.

@vwbusguy Ah, understandable. Is there a good place for me to put it?
@ashtonandrepont I am not a lawyer, but the Internet Archive might be a possibility.
@vwbusguy Me neither, but I was planning on uploading it there when I get them all downloaded.
@ashtonandrepont Ah! I see the public data. I'll grab it shortly. Thanks for bringing this to my attention!
@vwbusguy Happy to help! o7
@ashtonandrepont If you happen to know of a way to search ERIC by license so I can isolate the public domain stuff from the copyrighted stuff, then I can grab more material, but I didn't see that in the search options.

@vwbusguy I don't see it either :/

I do have a list of all the URL links to the files if that'd be helpful.

DeptEd

Public Data from Department of Education and NCES.

Git for LSIT at UCSB
@vwbusguy Holy crap, you did that fast! It's taking me a whole day at this rate to download what I'm trying to download.