Mastodawn

jonny (nonvenomous)Apr 27, 2024

A #p2p future for the web is the radical idea that its bad to put all data in a single place owned by 3 companies and rented by a few hundred. The internet wasnt a mistake, the cloud was a mistake. Platforms were a mistake. A mistake where its not only possible but routine for "everyone's health data" to get stolen. https://infosec.exchange/@patrickcmiller/112341111375581551

I need to share my health data with like 3 people that arent me. Why on earth is that data in the same pile as literally everyone else's.

Patrick C Miller :donor: (@[email protected])

UnitedHealth admits IT security breach could 'cover substantial proportion of people in America' https://go.theregister.com/feed/www.theregister.com/2024/04/23/unitedhealth_admits_breach_substantial/

Infosec Exchange

jonny (nonvenomous)Apr 27, 2024

Ive said it a million times before: in research, if open data mandates mean that all data needs to exist on some cloud server, we are in for a catastrophe of the highest order where cloud providers will be in a position to siphon off however much public funding they want - the "free open data" programs are bait. The corollary is the continual rise and sudden collapse of archives who miss one grant cycle or cant keep up with the infinitely expanding cloud bills.

Its not just a disaster for private data, but public data and the whole of the web. If researchers want their disciplines to continue to exist, to contribute to a healthy information ecosystem, and to realize the promise of the web for science, they should be investing in and demanding p2p infrastructure.

Gavin Chait;Apr 27, 2024

@jonny my 2 cents based on my experience with my https://openlocal.uk data research project... Only a small number of research specialisms permit grant funding for longitudinal data research. The costs aren't just hosting, but mostly curation. So the work isn't done or is precarious, or is forced to be semicommercial to keep the lights on.

openLocal

openLocal.uk is a quarterly-updated commercial location database supporting research into business properties in England and Wales.

openLocal

jonny (nonvenomous)Apr 27, 2024

@GavinChait
What if hosting costs were a one-time investment in on-site server infrastructure that is a seed node that other seed nodes can replicate and rehost so there is no single system to keep the lights on in the first place, and curation labor was not the responsibility of a single central research group but a participatory process by everyone who uses the data? What if there was no platform at all?

Gavin Chait;Apr 27, 2024

@jonny I get the principle, & it may work for relatively mainstream data (sort of like OpenStreetMap labelling) but a _lot_ more challenging for specialist / niche data where the learning curve is steep & the result interesting only to a handful of specialists.

I'm not sure what good potentials would be, other than the obvious Wikipedia & OpenStreetMap, & they still need paid staff to build software & maintain data integrity, even if you took away hosting costs.

What is your ideal example?

jonny (nonvenomous)Apr 27, 2024

@GavinChait the learning curve for p2p systems is not intrinsically any steeper than cloud systems, in fact there is a potential for it to be much simpler since much less needs to be concentrated in a single system. With generalized p2p the distinction between generalist and specialist data is actually much much easier - there doesn't need to be a specialist database or platform for each type of data.

neither wikipedia nor openstreetmap are p2p systems, and that is one of the reasons why they as platforms need to maintain paid staff to maintain data integrity.

there aren't any such systems, and there are good reasons for that having to do with the history of the platformatization of the web. the things that come closest are private bittorrent trackers which maintain excellent quality archives with next to zero budget in highly adversarial conditions.

here: https://jon-e.net/infrastructure/

Decentralized Infrastructure for (Neuro)science

Decentralized Infrastructure for (Neuro)science

Gavin Chait;Apr 27, 2024

@jonny Sorry, I meant the learning curve for curation itself. Complex data with dodgy probity are inherently difficult to interpret & restructure/label. Decentralisation is one aspect of a complex open data ecosystem (architecture & implementation of open data being my milieu). So, while I get how p2p can help with hosting/distribution, the underlying complexity of architecture, data interoperability, curation, etc are fairly severe. RDA has multiple research groups working on interoperability.

Gavin Chait;Apr 27, 2024

I also built a whole tabular data restructuring system (at https://whyqd.com), & my next roadmap steps are to support linked data for categorisation, plus shareable schema & crosswalks.

What you raise is important, I'm just not clear on where the blockers are. Is it curation/interoperability, or distribution, or hosting? I've lead multiple national open data publication projects, & it's common for the realisation of how difficult curation for publication is to end up killing projects.

whyqd.com

Perform schema-to-schema transforms for interoperability and data reuse. Transform messy data into structured schemas using readable, auditable methods.

whyqd.com

jonny (nonvenomous)Apr 27, 2024

@GavinChait i'm definitely not saying p2p solves data formats or curation, but that it changes those questions compared to a cloud platform model which frequently bundles formats, curation, schema, interface, and more into a single vertical stack. like what does curation look like? who does it? when does it happen? what are the bounds of heterogeneity? and so on.

that piece has a whole section on my thoughts on barriers and they're also interspersed throughout the piece, so i'll skip that part except w.r.t. the example of stuff like whyqd (which looks cool!).

I am a huge fan of making lots of maps between formats, like a huge huge fan of that practice and it's cool to see what looks like a well done version of it. i tend to think about three things here: the first is re: the tooling that causes people w/ data to so often end up with data like this in the first place (though no matter what there is always slippage between formats). the second is re: why our schema authoring and manipulation tools are so poor that we end up so often with people shoving complex things into single tables. and the third is re: why 'cleaning for publication' is a discrete step where some internal format needs to be matched to an external one, rather than manipulating the form and relating it to other forms being a continuous part of the life of the data.

anyway that's all again at length in that piece so i won't repeat myself

Gavin Chait;Apr 27, 2024

@jonny Well 😜 short answer is people aren't data literate, & it "seems" faster to start hacking away at Excel than sit down & research a data schema. I'm facilitating an RDA discussion looking into interoperable learning objectives for syllabus discovery & reuse, & this type of problem goes **deep**. There's also an RDA group researching anything-to-anything schema interoperability. But the challenge is the same, getting data owners to recognise that the first step is curation, not production.

jonny (nonvenomous)Apr 27, 2024

@GavinChait ya so basically i have been working on several ends - lowering barriers of formalization, raising the floor of tool output, and making better authoring tools. most people are quite literate and conversant in their particular internal format, so i tend to think its the notion of formats that needs to change more than people. part of that is, yes, dissolving the notion of platforms using p2p that necessitate the kind of formats we use in science now.

Gavin Chait;Apr 27, 2024

@jonny In academia, there really isn't any excuse for not using formal data methods (although, if you've ever supervised dissertations, you'll know the blood-rush to dive direct into doing analysis without a data plan). It's the bulk of work-based public data - local & national government particularly (from an open data perspective) where drivers, motivations & skills just fail.

Focusing on getting things right in academia would be tremendous. I don't hold out much hope for the public sector.

jonny (nonvenomous)Apr 27, 2024

@GavinChait there is lots of excuse in academia! and there are lots of good reasons too! I have been making tools for awhile that try and bridge that process so that one can responsibly dive into an experiment without a formal data plan because the data format develops alongside the experiment. I don't think about it in terms of getting data 'right' as much as facilitate fluid expression.

Gavin Chait;Apr 27, 2024

@jonny That's my intention behind Whyqd as well, that you don't need to formally declare a schema, or ensure interoperability, as it can progress naturally. But, as we find on the OpenLocal project, when working with 350 different publishers - none of whom make any effort to adhere even to government regulated classification standards (like, types of tax) - it's a challenge.

Have you seen https://cedar.metadatacenter.org? It's a means of creating interoperable research forms.

Project Cedar

jonny (nonvenomous)Apr 27, 2024

@GavinChait i feel like we're talking past each other a bit :) with respect, bc it's getting quite late here, what i'm talking about is thinking about the problem differently such that it wasn't "how do we get 350 different things into the same thing" ;) ttys

@jonny Timezones ;[ I think the problem is an elephant, & each of us has come to it from a slightly different place. I work direct with data & infrastructure for public/government open data so my perspective is slightly different, but I follow your work with interest.

jonny (nonvenomous)Apr 27, 2024

@GavinChait yes yes yes and it's a perspective i love to hear. i am just trying to log off and doing the usual 'trying to say goodbye but theres one more thing to think about as i'm putting on my coat and so come back in the door' and so on <3