There is an alternate timeline where the semantic web took off and there was wide investment in ontological tooling to ensure that the information in academic papers, websites, and applications was structured and accessible to future processing.

We instead live in a world where all the useful data is trapped inside proprietary formats, and entangled in meaningless prose - a world primed for large language models to come along and hallucinate the data that might contained therein.

I wanted to actually implement a transformer before I voiced an opinion on large language models. The whole self-attention structure is really cool and highly recommend sitting down and playing with the ideas directly.

I just wish we had done a lot of other work to fully take advantage of them.

@sarahjamielewis
Not just proprietary formats; a lot of academic papers are also locked up by the publishers

Anyone can run their language model on the Wikipedia, only a few researchers manage to get a hard disk from Elsevier

@sarahjamielewis Do you have any reading recommendations for the semantic web stuff? I'm very lost.

@Madagascar_Sky @sarahjamielewis I found this one painful and I don't agree with all of it, but very well written:

https://twobithistory.org/2018/05/27/semantic-web.html

In my opinion it misses Wikidata, but I'm biased.

Whatever Happened to the Semantic Web?

In 2001, Tim Berners-Lee, inventor of the World Wide Web, published an article in Scientific American.

@vrandecic @Madagascar_Sky @sarahjamielewis

Okay, right, Denny is biased, he helped make #Wikidata the right way :)

I used to teach how worthless the promises of Semantic Web were. But since Wikidata reached a useful size, there is now quite a lot of useful things one can do with the data, so now I teach that there is a glimmer of hope called Wikidata and federated databases....

@WiseWoman @vrandecic @Madagascar_Sky @sarahjamielewis I once heard Peter Norvig perfectly summarize the problems with the Semantic Web to a True Believer who was trying to proselyte him: "People are lazy, and they lie." (One can add other human traits too: they are mistaken, they disagree, etc.)

@shriramk @WiseWoman @Madagascar_Sky @sarahjamielewis

Those are very relevant observations, and this leads to the one large question the Semantic Web never answered: how does the incentive infrastructure look like? The few parts of the Semantic Web that provided a decent answer to that question were the ones that were successful: schema.org, Wikidata and the wider GLAM world, usage inside emails, inside organizations as a data integration technology.

@vrandecic @shriramk @WiseWoman @Madagascar_Sky @sarahjamielewis

Very interesting debate, I guess one upcoming incentive for semantic tech could be archiving and note keeping. For those that encounter so much info on feeds this could provide better private and collaborative means to collect stuff: ordering, selecting, describing, developing. I am thinking of the experience of using pinterest, tumblr, pinboard, obsidian, etc. as kind of cross-platform extensions to feeds and online collections.

@lukasfx @shriramk @WiseWoman @Madagascar_Sky @sarahjamielewis

Yes, I agree, I think Semantic Web technology is underused in the personal and collaborative note taking space. Hypothes.is is an interesting approach in that direction.

But I'm this space you have to explain what the difference is to bookmarking extensions in your browser, delicious, and the Google Sidewiki, and why it would succeed where these things are not widely used today.

@vrandecic
@shriramk @WiseWoman @Madagascar_Sky @sarahjamielewis
the hope is we can build it p2p ;)
https://jon-e.net/infrastructure/
I'm v much on Aaron Swartz page in that the "people lie" argument is a strawman - if you're trying to make the semweb as a space of communication rather than "true" and uniform data, it is no longer the fatal problem it's presented as.
Decentralized Infrastructure for (Neuro)science

Decentralized Infrastructure for (Neuro)science

@sarahjamielewis Google has managed to get a bunch of sites to add RDF metadata to their sites in the last few years by calling it JSON-LD.

It seems that telling people it'll increase their search engine performance works pretty well.

I think you describe the problem with the semantic web pretty well. Is not presented in a way people care about.

One should have presented it to researchers as: if you do it, your h-index will increase.

Please don't throw h-index hate my way for saying that. The reason it will increase is pretty simple: easier to use data implies more citations.

@sarahjamielewis NIH changed the rules on data sharing very recently so that all newly funded research will share all data in a timely way - including the code that generated the published results. It's not a standardization of language though, or of reporting which would have also helped but it's a start I think.
#Science #Epidemiology #NIH #SciencePolicy

@sarahjamielewis

Maybe if that had taken off, inventing the appropriate ontological tools would have effectively been == creating AI.

@red_concrete @sarahjamielewis yes, and it would have been a very different AI
@sarahjamielewis I blame the time wasted in Byzantine discussions on high order logic representations instead of investing time in useful tools for common developers. A wasted opportunity.
@sarahjamielewis Well, if the experience with The Semantic Web taught as anything than it is impossible to encode human knowledge in the structured information. So, it is an alternate timeline in the same sense as Harry Potter being real could be an alternate timeline.
@sarahjamielewis I believe the main reason why the semantic web failed isn't proprietary formats, but the fact that people are too lazy to annotate and categorize content. As a result, we get at best tagging (without any kind of ontology) and most of the time just automatic classification from the content.
@sarahjamielewis It is not only about proprietary data. Still the most compatible format for the most data-modelled possible, #bibliographic databases, is #BibTeX which nobidy would ever suspected was meant seriously as a data format.
@sarahjamielewis I'm a total noob, but isn't it actually improving?
@sarahjamielewis I'm TRYING but I can't do everything alone 😭
@sarahjamielewis proprietary and even portable document formats... 🤣 😭😭😭😭
@sarahjamielewis @Dan_Blick “hallucinate the data”… can’t imagine what we are in for.

@sarahjamielewis Nice thread you launched there. Back in the day I was heavily involved at W3C and kind of TimBL's loyal opposition, a Semantic-Web skeptic. I still sort of am, but remain open-minded, there's a there there but we haven't found it. In this timeline anyhow.

Having said all that, I object to the phrase “entangled in meaningless prose”. That prose is the real payload, we are language-centric creatures.

@sarahjamielewis and worse: #research that is usually 100% tax-funded is #paywalled by #rentseeking publishers that charge obscene amounts to even read their publications at all.
@sarahjamielewis
check this out we're on the same page and there are a growing number of ppl working on making this real ❤️❤️
https://jon-e.net/surveillance-graphs/
Surveillance Graphs

Vulgarity and Cloud Orthodoxy in Linked Data Infrastructures - A critical history of the semantic web and linked data, grappling with the next generation of surveillance capitalism, where grand corporate knowledge graphs devour the planet and sell it back to us as a glassy-eyed LLM personal assistants, will we remain stuck in the ideology of the cloud, or can we have better dreams?

Surveillance Graphs
@sarahjamielewis There is an alternate timeline where the semantic web took off and black had SEOs had tons of fun spamming the crap out of it with fake entries promoting their products, and then people moved on to better platforms.