Luca

@lfoppiano
2 Followers
7 Following
40 Posts

- Paper: https://arxiv.org/abs/2512.11192
- Models: https://huggingface.co/collections/scilons/scilons-models
- Dataset: https://huggingface.co/collections/scilons/scilons-datasets (including English cleaned text only, TEI-XML, JSON, Markdown)

4/4

SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding, including scholarly document processing.

arXiv.org
Finally, I was positively surprised to see that such a large number of people using and talking about #Grobid.
3/4
Moreover, we met in person after working remotely for a few years. Tall people don't look that tall on video conference.
2/4
Great experience at #LREC2026! With colleagues and friends from DFKI And Common Crawl, we presented our paper on the SciLaD 🥗 corpus and models https://arxiv.org/abs/2512.11192
1/4
5/ ⭐ One last thing: if you find GROBID useful, please star us on GitHub — it goes a long way in helping the project grow.
https://github.com/grobidOrg/grobid
Come help shape what's next. 🙌
5/5
4/ 🎨 And finally — we're refreshing the GROBID logo, and you get to pick.
Vote here → https://forms.gle/aGDNma9QznnwhUbB9
4/5
GROBID Logo

We are looking for selecting the GROBID logo. Something not serious and slightly stereotypical French.

Google Docs
3/ 💬 Prefer real-time chat? Join our new Discord server to hang out with users, contributors, and maintainers.
Invite → https://discord.gg/yuEaC4tYnz
3/5
Join the Grobid Discord Server!

Check out the Grobid community on Discord - hang out with 19 other members and enjoy free voice and text chat.

Discord
2/ 📬 We've launched an official mailing list (EN/FR) for announcements, discussions, and Q&A.
Subscribe → https://groupes.renater.fr/sympa/info/grobid
2/5
grobid - [GROBID] - info

1/ 📣 Three GROBID community updates in one go:
📬 New mailing list
💬 New Discord server
🎨 Logo vote is open
Details below 👇
1/5

This builds on the foundational harvesting work by Patrice Lopez & James Howison (SoftCite project), and is a collaboration with @DFKI, @HUBerlin, @CommonCrawl & Uni Mannheim.

Attending LREC? Let's connect!👋

#NLP #ScientificNLP #MultilingualNLP #SciLaD #ScienciaLAB #grobid
5/5