Mastodawn

Xebulun EnEssEitch Aug 11, 2024

Erlend Sogge Heggen

> (…) split the index, which is to say the part of Google Search that scrapes the web and makes that content searchable, from the search user interface, and manage that index as a public utility that different search services could rely on and pay for, an idea that was suggested in a recent paper.

That would be incredible. I wish regulators would make Commons Enforcement a regular part of their playbook.

Commonify the monopolizers’ complements!

https://gwern.net/complement

https://mastodon.social/@robin/112933151964297678

Laws of Tech: Commoditize Your Complement

A classic pattern in technology economics, identified by Joel Spolsky, is layers of the stack attempting to become monopolies while turning other layers into perfectly-competitive markets which are commoditized, in order to harvest most of the consumer surplus; discussion and examples.

Erlend Sogge Heggen Aug 10, 2024

@robin in other words, force Google and the other data barons to participate in the crawling of the web as a public utility, since it has been proven too valuable/powerful as proprietary property of any single company.

There’s also an efficient-resourcing argument to be made, because many sites are hit hard by swarms of crawlers, all collecting the same data. Frequently crawled websites would be considerably cheaper to run if mass-crawling was more regulated and operated shared infrastructure.

Melroy van den Berg Aug 10, 2024

@erlend @robin also then I have finally access to reddit. Since only Google was able to afford indexing reddit. 😒

Robin Berjon Aug 12, 2024

@erlend Yes, exactly. I don't know if they'll support that of course, but we're not the only ones to have mentioned it. It would be an easy case to make if we could prove that it's a natural monopoly, but that's not in the conclusions of the case at all so I don't know how realistic it is, as cool as it would be!

Djoerd Hiemstra 🍉Aug 12, 2024

@erlend @robin The EU project @openwebsearcheu is trying to create a web index as a public utility.

Robin Berjon Aug 13, 2024

@djoerd @erlend @openwebsearcheu Yes, I've heard about it, I hope something works out!

Juan Luis Aug 10, 2024

@erlend OMG this idea is so cool

Juan Luis Aug 17, 2024

@erlend Kept thinking about this. A coalition of companies could do this if they wanted. See for example how Amazon, Meta, Microsoft, TomTom formed Overture to compete with Google Maps https://overturemaps.org/

Tech knows the playbook. To disrupt an entrenched competitor, open data and collaboration is the only way.

Then the question is, why it hasn't happened yet?

Home - Overture Maps Foundation

Overture Maps Foundation

Luis Villa Aug 10, 2024

Fabrice Desré Aug 10, 2024

@luis_in_brief @erlend I haven't read that piece yet (@robin writes too much lol), but is that not what exists already with common crawl, and was used by many AI companies for their training?

Robin Berjon Aug 12, 2024

@fabrice @luis_in_brief @erlend No, Common Crawl is just a periodic crawl of the web. You can't really build a real search engine on it, it doesn't update often enough. (I mean you can, but people will be disappointed.) This would be more like an API service.

Fabrice Desré Aug 13, 2024

@robin @luis_in_brief @erlend Right, but common crawl is a good base to prototype such an API in order to figure out what it could look like. I guess we'll need a bit more flexibility than with opensearch endpoints.

Robin Berjon Aug 13, 2024

@fabrice @luis_in_brief @erlend Sure, but building from CC isn't necessarily for the faint of heart, for starters you need a good chunk of terabytes!

Luis Villa Aug 13, 2024

@robin @fabrice @erlend sure though … any serious attempt to do a search engine is going to require that! You can’t do ranking and searching without having most of a local copy yourself. This is not a project for the weak of budget, no matter how much you want to put on the ~~Library~~ Cache of Congress side.

Robin Berjon Aug 14, 2024

@luis_in_brief @fabrice @erlend Not if you have good infra. When dealing with this kind of volume, there are compute-over-data (where the code goes to the data instead of the other way around) approaches that don't need you to store everything. If signals are already indexed, you can compute your own ranking over them with your own weighting and filtering (which already makes a big difference). We could have search diversity not for nothing but for much much cheaper.

Rob Landley Aug 10, 2024

@erlend @tanquist The problem is demands to remove stuff from the global index go somewhere, and are judgment calls. DMCA takedown requests, Fosta/Sesta/Kosa/Comstock du jour pearl clutching, doxing, deplatforming nazis, leaked medical records with SSN and bank account numbers, EU "right to be forgotten"...

It's great to say split the objective from the subjective parts, but you haven't. Information generated by humans has inherent issues.