McDonald Observatory: HETDEX Opens Massive Cosmic Dataset to Scientists, Novices, and AI. “Today, the Hobby-Eberly Telescope Dark Energy Experiment (HETDEX) – which recently completed the largest survey ever taken of the early universe – has released all of its immense, information-rich database to the public.”

https://rbfirehose.com/2026/06/04/mcdonald-observatory-hetdex-opens-massive-cosmic-dataset-to-scientists-novices-and-ai/
McDonald Observatory: HETDEX Opens Massive Cosmic Dataset to Scientists, Novices, and AI

McDonald Observatory: HETDEX Opens Massive Cosmic Dataset to Scientists, Novices, and AI. “Today, the Hobby-Eberly Telescope Dark Energy Experiment (HETDEX) – which recently completed t…

ResearchBuzz: Firehose

FedTech Magazine: How NIH Is Translating 70 Years of Health Data to Speak the Same Language. “The National Institutes of Health sits on one of the largest collections of biomedical research data in the world. Decades of federally funded studies on health issues such as heart disease, lung conditions, sleep disorders and genomics have generated petabytes of information. But for most of that […]

https://rbfirehose.com/2026/06/03/fedtech-magazine-how-nih-is-translating-70-years-of-health-data-to-speak-the-same-language/
FedTech Magazine: How NIH Is Translating 70 Years of Health Data to Speak the Same Language

FedTech Magazine: How NIH Is Translating 70 Years of Health Data to Speak the Same Language. “The National Institutes of Health sits on one of the largest collections of biomedical research d…

ResearchBuzz: Firehose

Modern scraping problem:

Your parser is fine.
Your response is blocked 😅

I tested Bright Data Web Unlocker API with Python + BeautifulSoup to fetch protected, JS-rendered pages without managing proxies.

Full article 👇https://medium.com/gitconnected/how-i-scraped-modern-protected-websites-in-python-without-managing-a-single-proxy-2e0f07d30208

#DataEngineering
#WebScraping
#Datasets
#MachineLearning
#RAG

How I Scraped Modern, Protected Websites in Python Without Managing a Single Proxy

The hard part is not parsing HTML. It is getting a usable response from modern websites in the first place.

Medium

I spent 3 days building a LinkedIn scraper.

Then I found the dataset already existed 😅

Sometimes the best engineering decision is not to scrape more — but to check whether structured data is already available.

Full article 👇
https://medium.com/gitconnected/i-spent-3-days-building-a-linkedin-scraper-then-i-found-the-dataset-already-existed-9e9093504ca1

#DataEngineering
#WebScraping
#Datasets
#MachineLearning
#RAG

I Spent 3 Days Building a LinkedIn Scraper. Then I Found the Dataset Already Existed

A practical data engineer’s review of Bright Data’s Dataset Marketplace - and why ready-made LinkedIn and e-commerce datasets can save…

Medium
One Open-source Project Daily

pix2code: Generating Code from a Graphical User Interface Screenshot

https://github.com/tonybeltramelli/pix2code

#1ospd #opensource #datasets #deeplearning #deepneuralnetworks #frontenddevelopment #graphicaluserinterface
GitHub - tonybeltramelli/pix2code: pix2code: Generating Code from a Graphical User Interface Screenshot

pix2code: Generating Code from a Graphical User Interface Screenshot - tonybeltramelli/pix2code

GitHub
Today marks the final day of the workshop, with a focus on mobilising #datasets to @[email protected]. We’re pleased to be joined by Francisco Pando and Katia Cezón from GBIF Spain, alongside expert contributions from Jeroen Creuwels (GBIF Netherlands) and Kessy Abarenkov (University of Tartu). 📊

#gov #BigData #datasets #FederalData

'This Federal Data Field Guide includes more than eighty specific examples of federal datasets across more than 50 federal agencies.'

https://www.ischool.berkeley.edu/news/2026/new-federal-data-field-guide-helps-americans-navigate-rich-diversity-our-federal-data

New Federal Data Field Guide Helps Americans Navigate the Rich Diversity of Our Federal Data Ecosystem

Denice Ross, who served as the nation’s second U.S. Chief Data Scientist, and her former White House colleague Christopher Marcum have launched the Federal Data Field Guide.

UC Berkeley School of Information

UC Berkeley: New Federal Data Field Guide Helps Americans Navigate the Rich Diversity of Our Federal Data Ecosystem. “Denice Ross, who served as the nation’s second U.S. Chief Data Scientist, and her former White House colleague Christopher Marcum have launched the Federal Data Field Guide, a free, plain-language resource designed to help Americans understand, use, and advocate for the full […]

https://rbfirehose.com/2026/05/28/uc-berkeley-new-federal-data-field-guide-helps-americans-navigate-the-rich-diversity-of-our-federal-data-ecosystem/
UC Berkeley: New Federal Data Field Guide Helps Americans Navigate the Rich Diversity of Our Federal Data Ecosystem

UC Berkeley: New Federal Data Field Guide Helps Americans Navigate the Rich Diversity of Our Federal Data Ecosystem. “Denice Ross, who served as the nation’s second U.S. Chief Data Scientist,…

ResearchBuzz: Firehose

I am always surprised to see how, against all odds, CSV files remain the backbone of #data ingestion. What is even more surprising is how difficult it is to simply look at or quickly edit them without breaking something. So I built:

https://github.com/CedricBonjour/nanocell-csv

Coded with ❤️

I want to shape it around actual #dataengineering workflows. What feature would make this an instant bookmark for you?

like it ? Feel free to drop a star or break it with some of your worst #datasets and open an issue!

GitHub - CedricBonjour/nanocell-csv: A free csv file viewer & editor

A free csv file viewer & editor. Contribute to CedricBonjour/nanocell-csv development by creating an account on GitHub.

GitHub

"The open-source database of sanctions, watchlists, and politically exposed persons — aggregating hundreds of sources and relied on by compliance teams, investigators, and journalists. OpenSanctions is a financial crime data provider. "

#datasets #compliance #regulation

https://www.opensanctions.org/docs/about/

About OpenSanctions

The open-source database of sanctions, watchlists, and politically exposed persons — aggregating hundreds of sources and relied on by compliance teams, investigators, and journalists.

OpenSanctions.org