Mastodawn

Alexander Doria May 29, 2025

@alys @glyph Synthetic data is used for the instruct/specialized version (not the base model). And it always relies on permissive content as a seed.

Show thread

Alexander Doria May 29, 2025

@alys @glyph No the actual reason is that you can’t raise money for ethically trained data (also the leading factor why nothing else is happening). The models have now grown to be state of the art for retrieval augmented generation. https://arxiv.org/pdf/2504.18225v1

Show thread

Alexander Doria Feb 22, 2025

@ada @esm hi. Common corpus coordinator here. The base models were released very recently and we are now preparing the post-trained version, that will be usable for RAG and other use cases.

Corpus is recent too but fast becoming a standard. At least 7-8 European LLMs will be trained at least partially on Common Corpus in 2025.

The latest version of the corpus includes document-level license attribution, so easy to see whether it’s PD or free license.

Show thread

Alexander Doria Nov 17, 2023

@clive @_thegeoff Hi. MonadGPT creator here. Thanks a lot for the mention and the nice words, I’m really touched.

Alexander Doria Nov 17, 2023

Clive Thompson Nov 15, 2023

Pierre-Carl Langlais custom-trained a version of ChatGPT on English texts from the 17th century and earlier ...

... so that it speaks like a 17th-century learned monk ...

... whose factual knowledge of the world ends in the 17th century

It's called "Monad-GPT"

Here's a sample of the dialogue

Item #1 in my latest "Linkfest" newsletter, here: https://buttondown.email/clivethompson/archive/linkfest-13-17th-century-chatgpt-the-merovingian/

Linkfest #13: 17th-century ChatGPT, the "Merovingian" knot, and CT scans of knockoff Airpods

Welcome to the latest edition of the Linkfest! "The opposite of doomscrolling", as I call it 😅 Thank you for subscribing -- and if you’re enjoying it, spread...

Alexander Doria Nov 1, 2023

Un petit appel à contribution : je cherche des exercices de compréhension de textes en français sous licence libre. Idéalement, un test en QCM sur la base de textes de sujets variés (littérature, actualité, économie, etc.) un peu sur le modèle des exercices d’anglais en terminal, mais en français.

(Et oui c’est pour évaluer des LLMs).

Show thread

Alexander Doria Sep 27, 2023

@josquindebaz @adulau On en parlait justement cet après-midi mais toujours pas de news (ouf!)

Alexander Doria Sep 21, 2023

@Zestryon Je pense qu’on va surtout voir une séparation nette entre un écosystème libre strictement non-commercial et des applications plus fermées avec des garanties élevées contre le risque d’infraction au droit d’auteur.

Alexander Doria Jul 31, 2023

Irénée Régnauld 🐑&🚀Jul 31, 2023

Après l'avoir parcouru au moment de sa publication, je me suis replongé dans ce papier de @Dorialexander sur #ChatGPT (« Comment ça marche »). Et bien c'est vraiment bien amené, très clair et d'une taille idéal. Je le conseille vivement !
https://scoms.hypotheses.org/1059

ChatGPT : comment ça marche ?

Tout-le-monde en parle : chatGPT révolutionne l’enseignement, la programmation, la propagande, le marketing, la politique… Et pourtant, qui est chatGPT ? Tout d’abord deux modèles différents, souvent confondus. GPT c’est Generative Pre-trained Transformer 3, un modèle géant de prédiction de texte entraîné par OpenAI sur 500 milliards de mots. GPT-3 est non seulement capable d’écrire … Continuer la lecture de ChatGPT : comment ça marche ? →

Sciences communes

Show thread

Alexander Doria Jul 1, 2023

@Caligans @LaurentChemla moi c’est tout le fil qui ne d’actualise plus (ironiquement depuis ce message)

Wikidata	https://www.wikidata.org/wiki/Q27538435
Blog	https://scoms.hypotheses.org/