Mastodawn

In the introduction, we propose interviews as a method for documenting data provenance and making contextual information of cultural heritage data available.

Through interviewing data editors, archivists, and project managers, we aim to shed a light on their work, which often remains invisible but nevertheless has a major impact on film historical information infrastructures and thus on knowledge production.

https://doi.org/10.5281/zenodo.17234097

#DataProvenance #DataSheets #Interviews #DigitalMethods #DH

Interviews on Data Practices and Categorizations in Film History. An Introduction

In this paper, we propose interviews as a method for documenting data provenance and making contextual information of film historical and other cultural heritage data available. Through interrogating and documenting data practices, film historical databases are not only examined as research tools but also historicized and situated within their specific institutional and local contexts. Building on “datasheets for datasets” (Gebru et al. 2021) and “datasheets for digital cultural heritage datasets” (Alkemade et al. 2023), we introduce a set of questions for evaluating the available information on a specific dataset, which can serve as a starting point for developing case-specific interview guidelines. Based on our own research, we emphasize the importance of analyzing datasets for the uncertainties and gaps they may contain. Through interviewing data editors, archivists, project managers, and software developers, we aim to shed a light on their work, which often remains invisible but nevertheless has a major impact on film historical information infrastructures and thus on knowledge production. The paper introduces an interview series on data practices, provenance, and categorizations in film history initiated by the research group „Aesthetics of Access. Visualizing Research Data on Women in Film History“ (DAVIF) (2021–2025), funded by the German Federal Ministry of Research, Technology and Space (BMFTR). The goal of the research group is to explore data visualizations in order to tell different stories differently. With our interview series, we pursue three primary objectives: (1) to provide insights into the data practices of our project partners – the Women Film Pioneers Project (WFPP) and the DFF – Deutsches Filminstitut & Filmmuseum, and the early feminist film database project f_films, (2) to contribute to current debates on data provenance and research data management in film and media studies, and (3) to increase the visibility of film historical research agendas within the broader field of digital humanities. The interview series: https://doi.org/10.5281/zenodo.17233942 The project website: https://uni-marburg.de/Q85oo

Zenodo

Aneesh Sathe Jul 11

Beyond the Dataset

On the recent season of the show Clarkson’s farm, J.C. goes through great lengths to buy the right pub. As with any sensible buyer, the team does a thorough tear down followed by a big build up before the place is open for business. They survey how the place is built, located, and accessed. In their refresh they ensure that each part of the pub is built with purpose. Even the tractor on the ceiling. The art is in answering the question: How was this place put together?

A data-scientist should be equally fussy. Until we trace how every number was collected, corrected and cleaned, —who measured it, what tool warped it, what assumptions skewed it—we can’t trust the next step in our business to flourish.

Old sound (1925) painting in high resolution by Paul Klee. Original from the Kunstmuseum Basel Museum. Digitally enhanced by rawpixel.

Two load-bearing pillars

While there are many flavors of data science I’m concerned about the analysis that is done in scientific spheres and startups. In this world, the structure held up by two pillars:

How we measure — the trip from reality to raw numbers. Feature extraction.

How we compare — the rules that let those numbers answer a question. Statistics and causality.

Both of these related to having a deep understanding of the data generation process. Each from a different angle. A crack in either pillar and whatever sits on top crumbles. Plots, significance, AI predictions, mean nothing.

How we measure

A misaligned microscope is the digital equivalent of crooked lumber. No amount of massage can birth a photon that never hit the sensor. In fluorescence imaging, the point-spread function tells you how a pin-point of light smears across neighboring pixels; noise reminds you that light itself arrives from and is recorded by at least some randomness. Misjudge either and the cell you call “twice as bright” may be a mirage.

In this data generation process the instrument nuances control what you see. Understanding this enables us to make judgements about what kind of post processing is right and which one may destroy or invent data. For simpler analysis the post processing can stop at cleaner raw data. For developing AI models, this process extends to labeling and analyzing data distributions. Andrew Ng’s approach, in data-centric AI, insists that tightening labels, fixing sensor drift, and writing clear provenance notes often beat fancier models.

How we compare

Now suppose Clarkson were to test a new fertilizer, fresh goat pellets, only on sunny plots. Any bumper harvest that follows says more about sunshine than about the pellets. Sound comparisons begin long before data arrive. A deep understanding of the science behind the experiment is critical before conducting any statistics. The wrong randomization, controls, and lurking confounder eat away at the foundation of statistics.

This information is not in the data. Only understanding how the experiment was designed and which events preclude others enable us to build a model of the world of the experiment. Taking this lightly has large risks for startups with limited budgets and smaller experiments. A false positive result leads to wasted resources while a false negative presents opportunity costs.

The stakes climb quickly. Early in the COVID-19 pandemic, some regions bragged of lower death rates. Age, testing access, and hospital load varied wildly, yet headlines crowned local policies as miracle cures. When later studies re-leveled the footing, the miracles vanished.

Why the pillars get skipped

Speed, habit, and misplaced trust. Leo Breiman warned in 2001 that many analysts chase algorithmic accuracy and skip the question of how the data were generated. What he called the “two cultures.” Today’s tooling tempts us even more: auto-charts, one-click models, pretrained everything. They save time—until they cost us the answer.

The other issue is lack of a culture that communicates and shares a common language. Only in academic training is it possible to train a single person to understand the science, the instrumentation, and the statistics sufficiently that their research may be taken seriously. Even then we prefer peer review. There is no such scope in startups. Tasks and expertise must be split. It falls to the data scientist to ensure clarity and collecting information horizontally. It is the job of the leadership to enable this or accept dumb risks.

Opening day

Clarkson’s pub opening was a monumental task with a thousand details tracked and tackled by an army of experts. Follow the journey from phenomenon to file, guard the twin pillars of measure and compare, and reinforce them up with careful curation and open culture. Do that, and your analysis leaves room for the most important thing: inquiry.

#AI #causalInference #cleanData #dataCentricAI #dataProvenance #dataQuality #dataScience #evidenceBasedDecisionMaking #experimentDesign #featureExtraction #foundationEngineering #instrumentation #measurementError #science #startupAnalytics #statisticalAnalysis #statistics

jlinclabs May 23

JLINC provides cryptographic signing of data transfer agreements between agents and services combined with a zero-knowledge audit trail to ensure every action is traceable and verifiable without exposing private data. Companies can always prove and verify data transfers under the original agreement with a user. It's clear we built the swiftest #dataprovenance guardrails for #AI #compliance which eventually lead to quality training and outputs.
https://www.wired.com/story/ai-agents-legal-liability-issues/

Who’s to Blame When AI Agents Screw Up?

As Google and Microsoft push agentic AI systems, the kinks are still being worked on how agents interact with each other—and intersect with the law.

WIRED

PSDI Apr 20

Unclear data provenance creates confusion and limits trust.

Without knowing how data was processed or derived, reproducibility suffers and analysis becomes unreliable.

Tracking every step, from collection to transformation, ensures transparency, accountability, and scientific integrity. Make provenance a priority to support robust, FAIR-aligned research! 🧩

#PSDI #FAIRData #DataProvenance #Reproducibility

Christian Meesters Aug 29, 2024

Years ago, I came across this editorial:

https://molecularbrain.biomedcentral.com/articles/10.1186/s13041-020-0552-2

It has been highly praised for addressing the missing data issue in neurology research.

Today, I was looking for a statistical overview about papers on missing data in #DrugDiscovery. I did find some articles, however, none which provide some summary statistics of good #DataProvenance or bad in comparison, particularly the rate of non-available data.

Do you know any?

Boosts appreciated.

#DataManagement #ReproducibleResearch

No raw data, no science: another possible source of the reproducibility crisis - Molecular Brain

A reproducibility crisis is a situation where many scientific studies cannot be reproduced. Inappropriate practices of science, such as HARKing, p-hacking, and selective reporting of positive results, have been suggested as causes of irreproducibility. In this editorial, I propose that a lack of raw data or data fabrication is another possible cause of irreproducibility.As an Editor-in-Chief of Molecular Brain, I have handled 180 manuscripts since early 2017 and have made 41 editorial decisions categorized as “Revise before review,” requesting that the authors provide raw data. Surprisingly, among those 41 manuscripts, 21 were withdrawn without providing raw data, indicating that requiring raw data drove away more than half of the manuscripts. I rejected 19 out of the remaining 20 manuscripts because of insufficient raw data. Thus, more than 97% of the 41 manuscripts did not present the raw data supporting their results when requested by an editor, suggesting a possibility that the raw data did not exist from the beginning, at least in some portions of these cases.Considering that any scientific study should be based on raw data, and that data storage space should no longer be a challenge, journals, in principle, should try to have their authors publicize raw data in a public database or journal site upon the publication of the paper to increase reproducibility of the published results and to increase public trust in science.

BioMed Central

💧🌏 Greg Cocks Jul 10, 2024

Analysis-Ready, Cloud Optimized ERA5
--
https://github.com/google-research/arco-era5 <-- git hub repository
--
I am trying to understand all the technical details and use case(s) for this project, but I will get there – but thought others might find it of interest..
#GIS #spatial #mapping #remotesensing #ARCO #ERA5 #global #spatialanalysis #spatiotemporal #code #model #modeling #visualisation #global #GitHub #opensource #hourly #GoogleCloudPublicDatasets #climate #climatechange #ECMWF #atmosphere #weather #extremeweather #NWP #weatherprediction #numericalweatherprediction #meteorology #interpolation #preprocessing #dataprovenance

GitHub - google-research/arco-era5: Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets. - google-research/arco-era5

GitHub

katch wreck Jun 1, 2023

`Human rights campaigners say there is an urgent need for a formal system to gather and safely store deleted content. This would include preserving metadata to help verify the content and prove it hasn't been tampered with.`

https://www.bbc.com/news/technology-65755517

#warCrimes #dataProvenance

AI: War crimes evidence erased by social media platforms

Footage of potential human rights abuses may be lost after platforms delete it, the BBC has found.

BBC News