Finally, I was positively surprised to see that such a large number of people using and talking about #Grobid.
3/4

This builds on the foundational harvesting work by Patrice Lopez & James Howison (SoftCite project), and is a collaboration with @DFKI, @HUBerlin, @CommonCrawl & Uni Mannheim.

Attending LREC? Let's connect!👋

#NLP #ScientificNLP #MultilingualNLP #SciLaD #ScienciaLAB #grobid
5/5

5/ Infrastructure: JDK 21, Gradle 9, TensorFlow 2.17 (Python 3.10–3.11), pdfalto 0.6.0, wapiti 1.5.1, virtualenv/conda support for DeLFT.
6/ Full release notes → https://github.com/kermitt2/grobid/releases/tag/0.9.0

#GROBID #OpenSource #NLP #ScholarlyInfrastructure
6/6

Release 0.9.0 · grobidOrg/grobid

What's Changed Added Conflict of interest and author contributions statement extraction in header and segmentation models #1319 Extract figures, tables and equations from back/annex sections #1215...

GitHub
Under the hood it's powered by #Grobid, a battle-tested machine learning library for extracting structured data from scientific documents. The same technology used to process millions of PDFs at scale — now doing one job really well, in one click.
5/7

Here's an experiment with prompting #Claude to write a #llm - powered #TEI #annotation pipeline with #evaluation. I also prompted it to run an experiment to compare the performance of Gemini 2.0 Flash vs. Llama 3.3 70b (via saia.gwdg.de) when annotating <tei:bibl> elements.

https://github.com/cboulanger/tei-annotator
https://github.com/cboulanger/tei-annotator/blob/main/docs/batch-annotation-experiment.md

Very early stage, but promising results so far. Don't want to reinvent the wheel though - what other projects (besides #Grobid) engage in LLM/ML-based TEI annotation?

On the 26-27 November we held the #Grobid Camp at the Centre de #Inria Paris.
The goal was to have a meeting with the major players in the French community which spaces from government institutes, to companies and large scale projects.
1/4

We use #grobid and the plos1000 #goldstandard as a baseline to compare the performance of LLM-based solutions.

Takeaways:

- Grobid still better choice for literature similar to the type it was trained on (mostly English-language STEM scholarship), since it is much faster & less resource-intensive
- For footnoted literature, experiments with LLamore/#Gemini show 3x better performance

Reminder: PKP invites communities to register for its Software Development Update webinar on December 16th, 2024, at 8 AM PST.

Topics

> #OpenJournalSystems (#OJS) / #OpenMonographPress (OMP) / #OpenPreprintSystems (OPS) version 3.5.0 preview and release timeline

> Typesetting workflow
> Tasks and Discussions
> Receiving emails in OJS
> Breaking the upload / download pattern with #WebDAV
> Pre-filling metadata automatically with #Grobid

Registration: https://pkp.sfu.ca/2024/11/28/pkp-software-development-update-registration-for-december-16-2024/

PKP Software Development Update: Join us December 16th, 2024 - Public Knowledge Project

PKP invites communities to register for its Software Development Update webinar on December 16th, 2024, at 8 AM PST.

Public Knowledge Project

⚙ You are invited to PKP's next Software Development Update webinar!

December 16 2024, 8 AM PST

Topics

* #OpenJournalSystems (#OJS) / #OpenMonographPress (OMP) / #OpenPreprintSystems (OPS) v3.5.0 preview and release timeline

* Typesetting workflow

* Tasks / Discussions

* Receiving emails in OJS

* Breaking upload / download pattern with #WebDAV

* Pre-filling #metadata automatically with #Grobid

Details and registration:

https://pkp.sfu.ca/2024/11/28/pkp-software-development-update-registration-for-december-16-2024/

Hope to meet you there!

PKP Software Development Update: Join us December 16th, 2024 - Public Knowledge Project

PKP invites communities to register for its Software Development Update webinar on December 16th, 2024, at 8 AM PST.

Public Knowledge Project
@osma @storytracer Hi-just found this old thread - we're just working on a #referenceextraction & #evaluation workflow involving #LLMs to measure their performance using a hand-annotated dataset of older scholarly articles with #footnotes . Untrained #GROBID performs very badly but that does not mean that it will when properly trained with a good dataset.