Mastodawn

Erlend Sogge Heggen Aug 10, 2024

> (…) split the index, which is to say the part of Google Search that scrapes the web and makes that content searchable, from the search user interface, and manage that index as a public utility that different search services could rely on and pay for, an idea that was suggested in a recent paper.

That would be incredible. I wish regulators would make Commons Enforcement a regular part of their playbook.

Commonify the monopolizers’ complements!

https://gwern.net/complement

https://mastodon.social/@robin/112933151964297678

Laws of Tech: Commoditize Your Complement

A classic pattern in technology economics, identified by Joel Spolsky, is layers of the stack attempting to become monopolies while turning other layers into perfectly-competitive markets which are commoditized, in order to harvest most of the consumer surplus; discussion and examples.

Show thread

Luis Villa

@erlend 👀

Show thread

Fabrice Desré Aug 10, 2024

@luis_in_brief @erlend I haven't read that piece yet (@robin writes too much lol), but is that not what exists already with common crawl, and was used by many AI companies for their training?

Show thread

Robin Berjon Aug 12, 2024

@fabrice @luis_in_brief @erlend No, Common Crawl is just a periodic crawl of the web. You can't really build a real search engine on it, it doesn't update often enough. (I mean you can, but people will be disappointed.) This would be more like an API service.

Show thread

Fabrice Desré Aug 13, 2024

@robin @luis_in_brief @erlend Right, but common crawl is a good base to prototype such an API in order to figure out what it could look like. I guess we'll need a bit more flexibility than with opensearch endpoints.

Show thread

Robin Berjon Aug 13, 2024

@fabrice @luis_in_brief @erlend Sure, but building from CC isn't necessarily for the faint of heart, for starters you need a good chunk of terabytes!

Show thread

Luis Villa Aug 13, 2024

@robin @fabrice @erlend sure though … any serious attempt to do a search engine is going to require that! You can’t do ranking and searching without having most of a local copy yourself. This is not a project for the weak of budget, no matter how much you want to put on the ~~Library~~ Cache of Congress side.

Show thread

Robin Berjon Aug 14, 2024

@luis_in_brief @fabrice @erlend Not if you have good infra. When dealing with this kind of volume, there are compute-over-data (where the code goes to the data instead of the other way around) approaches that don't need you to store everything. If signals are already indexed, you can compute your own ranking over them with your own weighting and filtering (which already makes a big difference). We could have search diversity not for nothing but for much much cheaper.