The entire BBC In Our Time archive browsable by Dewey-Decimal code? Yes please
I made a website to find old episodes of In Our Time to listen to. There are almost a thousand, it’s my starting point for any new topic
Very early, suggestions welcome
The entire BBC In Our Time archive browsable by Dewey-Decimal code? Yes please
I made a website to find old episodes of In Our Time to listen to. There are almost a thousand, it’s my starting point for any new topic
Very early, suggestions welcome
There's heavy use of GPT-3 in making this work! Both in extracting machine-readable data, and classifying episodes by library code
I feel like this programmatic use of LLMs is where AI gets really interesting
Details on the About page https://genmon.github.io/braggoscope/about
And some bonus material!
Here are the episodes on a chart (hover to see the title). Embeddings -> principal component analysis -> first 2 components (i.e. most significant) plotted. Similar episodes are "nearby". Code provided by OpenAI, I didn't do anything special here
Could this lead somewhere interesting? Thinking...
https://interconnected.org/more/2023/02/in_our_time-PCA-plot.html
@genmon The scatter plot is super interesting. I want to learn more about the outliers! I’ve probably listened to every single episode that’s been podcasted. Beyond a Dewey-Decimal number, could you ask it for top 3-5 tags? Then we could find “Money” topics across economics, society, history, etc.
What IoT topics cut across multiple categories?
@briansuda as it happens I did also request tags! They're unreliable, it turns out -- it seems you need a well-known controlled vocab to pin it down. And GPT is really bad at assigning multiple, different topics to the same episode
Even when they did work, browsing wasn't significantly different from using "Similar episodes"
So I think maybe playing more with embedding space is the way forward. There's a technique called TCAVs I want to try
@jamesking no! that's the amazing thing -- there's tons of automation, really only possible because of GPT-3 as a web scraping and categorising tool
Details on the About page https://genmon.github.io/braggoscope/about
@thatandromeda @jamesking eyeball. There seem to be a few arguable placements, and one out-and-out GPT misfire that I've spotted so far (Lawrence of Arabia under History of the Ancient World)
A big problem with this technique is it's not very tuneable. So I'm looking for alternatives (still with automation)... there's a technique called TCAVs which is interesting (proximity in embedding space) but some digging required there
During the parade of nations at the 2008 Beijing Summer Olympics, Greece’s athletes entered the stadium first, as per a long-standing tradition. But instead of following in alphabetical order, other countries came out in a sequence corresponding to the number of strokes each nation’s name had in Chinese characters. Jamaica, for example, was followed by
@genmon I hope that's not the only way of navigating through this data.
#Dewey is really a very bad idea in general and this dirty workaround should be replaced by more practical alterantives as soon as possible.
I really don't get it why so many people still use #DCC as hierarchy. Its structure is a frozen state of a hundred year-old bias which doesn't reflect our reality any more for many decades.
https://en.wikipedia.org/wiki/Dewey_Decimal_Classification#Influence_and_criticism
@G0OXO @genmon Yes, I know that we've got many old-school systems that are so deep down in DCC that it's hard to migrate to a better concept. However, that doesn't make DCC any better. It's an old dinosaur that refuses to get extinct. 😔
On https://www.reddit.com/r/datacurator/ there is a fanbase of DCC, applying it even for computer file management. 🤦♂️ 🤷
@feelinglistless @magslhalliday yeah here’s the directory!
https://genmon.github.io/braggoscope/directory
(Check the About page for how it works)