A #p2p future for the web is the radical idea that its bad to put all data in a single place owned by 3 companies and rented by a few hundred. The internet wasnt a mistake, the cloud was a mistake. Platforms were a mistake. A mistake where its not only possible but routine for "everyone's health data" to get stolen. https://infosec.exchange/@patrickcmiller/112341111375581551

I need to share my health data with like 3 people that arent me. Why on earth is that data in the same pile as literally everyone else's.

Patrick C Miller :donor: (@[email protected])

UnitedHealth admits IT security breach could 'cover substantial proportion of people in America' https://go.theregister.com/feed/www.theregister.com/2024/04/23/unitedhealth_admits_breach_substantial/

Infosec Exchange

Ive said it a million times before: in research, if open data mandates mean that all data needs to exist on some cloud server, we are in for a catastrophe of the highest order where cloud providers will be in a position to siphon off however much public funding they want - the "free open data" programs are bait. The corollary is the continual rise and sudden collapse of archives who miss one grant cycle or cant keep up with the infinitely expanding cloud bills.

Its not just a disaster for private data, but public data and the whole of the web. If researchers want their disciplines to continue to exist, to contribute to a healthy information ecosystem, and to realize the promise of the web for science, they should be investing in and demanding p2p infrastructure.

i am not an infosec person but it would be cool if more ppl saw the sources of vulnerabilities not in some specific vuln in like microsoft SoftServ2024.1 but instead in the basic architecture of SaaS. seems like even bigass corporations could benefit from systems that sharded their data across a ton of peers s.t. it's easier to detect when like someone tries to access and freeze all your independent peers at the same time, maybe later accessed ones can get some kind of advance warning or whatever you know.

I am aware that like 'most breaches are social engineering' or whatever, but still it seems like whatever is making it possible for huge tranches of data to be stolen at once and for single systems to be so precious that they could do with a little peerwise behavior. we have tons of abstractions like orchestration for distributing tasks and services to independent nodes, and for making many nodes appear like one node, but very few for letting those nodes have independent behavior s.t. they can coordinate those tasks among themselves

@jonny tldr: adding complexity makes it harder to secure

But yes, there are a LOT of things where we introduce segmentation in order to isolate one part of a system from another

Most breaches have so many issues long before they reach the data. Making the data storage and isolation design more complex is a harder challenge than addressing the many many low hanging fruit that a breached company typically has.

There's also generally a large number of data collections, not just one place with all the data.

@saraislet thank you for the context <3.

i miswrote a little bit in framing the whole thing like 'even large companies should need this' because the point i was trying to make is that 'we shouldn't have such large targets in the first place, but even for the large targets...'

from all i have read and heard, that seems to be the one consistent thing: the real source of the problem is people just being sort of sloppy and forgetful, and no technology fixes that, so that makes sense.

@saraislet i'm sure there is a lot of Discourse in the Field about what constitutes an 'insecurable' system at some scale - like nothing is perfectly secure in principle, but what constitutes a system that is insecurable in practice and when to make that call that it shouldnt exist

@jonny putting it in threat modeling terms could help clarify how we think about it:
1. Who are the threat actors? What resources do they have, and how much expertise/sophistication do they have? (e.g., nation states)

2. What drives the threat actors? (e.g., financial gain of selling data/content that they steal?)

3. What do the threat actors want to achieve? (e.g., personal data theft, financial data theft, content theft, outages, reputation damage)

4. What do we (the organization/subject/target) want to protect? (e.g., data, service availability, reputation — and in which order?)

5. What's the attack surface, architecture, system design, etc?

Often the attackers with the most resources aren't the ones that are most motivated to achieve a given goal — like a nation state isn't likely to burn a zero day in order to pirate the next Ubisoft game before release day.

@saraislet this reframing is always the one i need and i really really appreciate you breaking it down into specific factors like this, it often feels like a nebulous process to me. i want to plot out a few thoughts on these axes when back at desk
@jonny
As I like to say, I didn't originally learn threat modeling in security. I learned how to threat model by being a queer kid wanting to avoid harassment, bullying, and being outed.
@saraislet this is part of why it feels so nebulous to me, because "it's just the way you live, right?" so seeing it put concisely is amazing to me

@jonny
So typically when there's a new product or a new part of a system we're working on, we develop a threat model to guide our reasoning on what risks we need to prepare for and how to protect that system or product. The first step is identifying attackers, motivations, etc. If it's a sophisticated and high resource attacker with strong motivations (low cost and high value to carry out an attack), we're going to give very different advice on what level of security investment we need to make in designing security controls (prevention, detection, and response). Because we cannot put a high investment in everything.

In practice, it's very rare that we come across risks that can't be mitigated down to a reasonable level — but yes, I've worked on projects where I explained the risk and the consequences and the challenges of a situation where we don't have the ability to apply strong controls, and the engineering leaders made the decision to shut down that area of the project because the project wasn't valuable enough against the risks and consequences on the table, it just wasn't worth it.

@jonny
More frequently, we have high risk projects that are also high value, and where we can build whatever controls we need, so we figure out the right level of investment to manage our risks, and we do what we need to do.

However, systems and data sets aren't just one thing, one system. There's a lot of things, and most of it is technically "connected". We draw a lot of boundaries, and there's a lot of places where the only exchange across a boundary is a very specific, limited transaction like a simple https GET request that can only be answered with a specific limited set of data, like a CDN. And at the end of the day, "user downloads file from CDN" is, well, the basic user story of a CDN.

@jonny
At risk of oversimplifying all too many times: there's definitely data sets in the world that I believe shouldn't exist (e.g., credit scores!?). But the most demonized data gathering criticism I hear is deeply out of touch with what's actually, genuinely dangerous and should be stopped.

I cannot share most of those stories, but the things that we take for granted are often the most dangerous. Examples I can point out include VPNs and auto manufacturers. But I'm sure using duckduckgoose will sell someone a bridge.

@saraislet i am extremely curious to hear stories about where you think threat doesn't match story, and love hearing 'the real problem is in plain sight all along' stories as much as the next person, but obvi respect the need for confidentiality. maybe its easier to share what you mean in the other direction, the kinds of practices that are overcriticized that are actually not as big of a deal as people think?

(gonna drop c/w bc i feel like it's weird that you are posting with it bc you are absolutely not out of your lane lol)

@jonny demanding end to end encryption for everything is a wild fantasy dragon sorely searching for its Saint George

Having E2EE for some critical services like Signal is absolutely essential to be available to the people who need it. Most of us do not need that most of the time — but normalizing Signal is relatively good and relatively harmless.

Virtually no one needs E2EE for corporate or personal video meetings, or email or chat or file sharing. And it's a shitty user experience to bind with any kind of durable identity (otherwise it's all ephemeral and thrown away in each new interaction, which isn't much of an email or file sharing system). But then that identity often violates half the reason to have E2EE. And when users inevitably lose their keys they get mad that they can't access their data anymore. It's a disaster. No one wants to spy on our weekly girls night chat. And it's not even worth trying to decrypt non-e2ee active video streams for my corporate security conversations. Too much effort, and too little value. How do they identify a conversation of value and get in the right place to intercept, and record all the packets and then work to decrypt and then find out we discussed kittens for half the meeting.

(There are some few who very much do need E2EE file sharing, though many fewer need video. Files are generally a better route anyway.)

@jonny my 2 cents based on my experience with my https://openlocal.uk data research project... Only a small number of research specialisms permit grant funding for longitudinal data research. The costs aren't just hosting, but mostly curation. So the work isn't done or is precarious, or is forced to be semicommercial to keep the lights on.
openLocal

openLocal.uk is a quarterly-updated commercial location database supporting research into business properties in England and Wales.

openLocal
@GavinChait
What if hosting costs were a one-time investment in on-site server infrastructure that is a seed node that other seed nodes can replicate and rehost so there is no single system to keep the lights on in the first place, and curation labor was not the responsibility of a single central research group but a participatory process by everyone who uses the data? What if there was no platform at all?

@jonny I get the principle, & it may work for relatively mainstream data (sort of like OpenStreetMap labelling) but a _lot_ more challenging for specialist / niche data where the learning curve is steep & the result interesting only to a handful of specialists.

I'm not sure what good potentials would be, other than the obvious Wikipedia & OpenStreetMap, & they still need paid staff to build software & maintain data integrity, even if you took away hosting costs.

What is your ideal example?

@GavinChait the learning curve for p2p systems is not intrinsically any steeper than cloud systems, in fact there is a potential for it to be much simpler since much less needs to be concentrated in a single system. With generalized p2p the distinction between generalist and specialist data is actually much much easier - there doesn't need to be a specialist database or platform for each type of data.

neither wikipedia nor openstreetmap are p2p systems, and that is one of the reasons why they as platforms need to maintain paid staff to maintain data integrity.

there aren't any such systems, and there are good reasons for that having to do with the history of the platformatization of the web. the things that come closest are private bittorrent trackers which maintain excellent quality archives with next to zero budget in highly adversarial conditions.

here: https://jon-e.net/infrastructure/

Decentralized Infrastructure for (Neuro)science

Decentralized Infrastructure for (Neuro)science
Decentralized Infrastructure for (Neuro)science

Decentralized Infrastructure for (Neuro)science
@jonny Sorry, I meant the learning curve for curation itself. Complex data with dodgy probity are inherently difficult to interpret & restructure/label. Decentralisation is one aspect of a complex open data ecosystem (architecture & implementation of open data being my milieu). So, while I get how p2p can help with hosting/distribution, the underlying complexity of architecture, data interoperability, curation, etc are fairly severe. RDA has multiple research groups working on interoperability.

I also built a whole tabular data restructuring system (at https://whyqd.com), & my next roadmap steps are to support linked data for categorisation, plus shareable schema & crosswalks.

What you raise is important, I'm just not clear on where the blockers are. Is it curation/interoperability, or distribution, or hosting? I've lead multiple national open data publication projects, & it's common for the realisation of how difficult curation for publication is to end up killing projects.

whyqd.com

Perform schema-to-schema transforms for interoperability and data reuse. Transform messy data into structured schemas using readable, auditable methods.

whyqd.com

@GavinChait i'm definitely not saying p2p solves data formats or curation, but that it changes those questions compared to a cloud platform model which frequently bundles formats, curation, schema, interface, and more into a single vertical stack. like what does curation look like? who does it? when does it happen? what are the bounds of heterogeneity? and so on.

that piece has a whole section on my thoughts on barriers and they're also interspersed throughout the piece, so i'll skip that part except w.r.t. the example of stuff like whyqd (which looks cool!).

I am a huge fan of making lots of maps between formats, like a huge huge fan of that practice and it's cool to see what looks like a well done version of it. i tend to think about three things here: the first is re: the tooling that causes people w/ data to so often end up with data like this in the first place (though no matter what there is always slippage between formats). the second is re: why our schema authoring and manipulation tools are so poor that we end up so often with people shoving complex things into single tables. and the third is re: why 'cleaning for publication' is a discrete step where some internal format needs to be matched to an external one, rather than manipulating the form and relating it to other forms being a continuous part of the life of the data.

anyway that's all again at length in that piece so i won't repeat myself

@GavinChait doing some cool stuff re: interop that is already relatively public in linkml and soon to be demoed and written up by splitting up the stack in a lil bit of a different way than usual
@jonny Well 😜 short answer is people aren't data literate, & it "seems" faster to start hacking away at Excel than sit down & research a data schema. I'm facilitating an RDA discussion looking into interoperable learning objectives for syllabus discovery & reuse, & this type of problem goes **deep**. There's also an RDA group researching anything-to-anything schema interoperability. But the challenge is the same, getting data owners to recognise that the first step is curation, not production.
@GavinChait ya so basically i have been working on several ends - lowering barriers of formalization, raising the floor of tool output, and making better authoring tools. most people are quite literate and conversant in their particular internal format, so i tend to think its the notion of formats that needs to change more than people. part of that is, yes, dissolving the notion of platforms using p2p that necessitate the kind of formats we use in science now.

@jonny In academia, there really isn't any excuse for not using formal data methods (although, if you've ever supervised dissertations, you'll know the blood-rush to dive direct into doing analysis without a data plan). It's the bulk of work-based public data - local & national government particularly (from an open data perspective) where drivers, motivations & skills just fail.

Focusing on getting things right in academia would be tremendous. I don't hold out much hope for the public sector.

@GavinChait there is lots of excuse in academia! and there are lots of good reasons too! I have been making tools for awhile that try and bridge that process so that one can responsibly dive into an experiment without a formal data plan because the data format develops alongside the experiment. I don't think about it in terms of getting data 'right' as much as facilitate fluid expression.

@jonny That's my intention behind Whyqd as well, that you don't need to formally declare a schema, or ensure interoperability, as it can progress naturally. But, as we find on the OpenLocal project, when working with 350 different publishers - none of whom make any effort to adhere even to government regulated classification standards (like, types of tax) - it's a challenge.

Have you seen https://cedar.metadatacenter.org? It's a means of creating interoperable research forms.

Project Cedar

@GavinChait i feel like we're talking past each other a bit :) with respect, bc it's getting quite late here, what i'm talking about is thinking about the problem differently such that it wasn't "how do we get 350 different things into the same thing" ;) ttys
@jonny Timezones ;[ I think the problem is an elephant, & each of us has come to it from a slightly different place. I work direct with data & infrastructure for public/government open data so my perspective is slightly different, but I follow your work with interest.
@GavinChait yes yes yes and it's a perspective i love to hear. i am just trying to log off and doing the usual 'trying to say goodbye but theres one more thing to think about as i'm putting on my coat and so come back in the door' and so on <3

@jonny

Tell me about it - our non-redundant approach to digital materials is a big worry.

I've been trying to access various volumes of the Smithsonian Contributions to Zoology over the last couple of days and all I get is this.

@jonny companies are allergic to capital expenditures, and selling off your servers in exchange for a monthly fee can really make a quarter pop.

"We own nothing, we're so nimble. All we do is rent factories in China with guaranteed outputs. We're basically like any other incestment firm but we have a very special list of customers/suppliers and maybe some intellectual property, which is very real. Invest in us! Who needs a moat?"

@cykonot
Exactly. An internet for the people wont come from companies