Here's an experiment with prompting #Claude to write a #llm - powered #TEI #annotation pipeline with #evaluation. I also prompted it to run an experiment to compare the performance of Gemini 2.0 Flash vs. Llama 3.3 70b (via saia.gwdg.de) when annotating <tei:bibl> elements.

https://github.com/cboulanger/tei-annotator
https://github.com/cboulanger/tei-annotator/blob/main/docs/batch-annotation-experiment.md

Very early stage, but promising results so far. Don't want to reinvent the wheel though - what other projects (besides #Grobid) engage in LLM/ML-based TEI annotation?

On the 26-27 November we held the #Grobid Camp at the Centre de #Inria Paris.
The goal was to have a meeting with the major players in the French community which spaces from government institutes, to companies and large scale projects.
1/4

We use #grobid and the plos1000 #goldstandard as a baseline to compare the performance of LLM-based solutions.

Takeaways:

- Grobid still better choice for literature similar to the type it was trained on (mostly English-language STEM scholarship), since it is much faster & less resource-intensive
- For footnoted literature, experiments with LLamore/#Gemini show 3x better performance

Reminder: PKP invites communities to register for its Software Development Update webinar on December 16th, 2024, at 8 AM PST.

Topics

> #OpenJournalSystems (#OJS) / #OpenMonographPress (OMP) / #OpenPreprintSystems (OPS) version 3.5.0 preview and release timeline

> Typesetting workflow
> Tasks and Discussions
> Receiving emails in OJS
> Breaking the upload / download pattern with #WebDAV
> Pre-filling metadata automatically with #Grobid

Registration: https://pkp.sfu.ca/2024/11/28/pkp-software-development-update-registration-for-december-16-2024/

PKP Software Development Update: Join us December 16th, 2024 - Public Knowledge Project

PKP invites communities to register for its Software Development Update webinar on December 16th, 2024, at 8 AM PST.

Public Knowledge Project

⚙️ You are invited to PKP's next Software Development Update webinar!

December 16 2024, 8 AM PST

Topics

* #OpenJournalSystems (#OJS) / #OpenMonographPress (OMP) / #OpenPreprintSystems (OPS) v3.5.0 preview and release timeline

* Typesetting workflow

* Tasks / Discussions

* Receiving emails in OJS

* Breaking upload / download pattern with #WebDAV

* Pre-filling #metadata automatically with #Grobid

Details and registration:

https://pkp.sfu.ca/2024/11/28/pkp-software-development-update-registration-for-december-16-2024/

Hope to meet you there!

PKP Software Development Update: Join us December 16th, 2024 - Public Knowledge Project

PKP invites communities to register for its Software Development Update webinar on December 16th, 2024, at 8 AM PST.

Public Knowledge Project
@osma @storytracer Hi-just found this old thread - we're just working on a #referenceextraction & #evaluation workflow involving #LLMs to measure their performance using a hand-annotated dataset of older scholarly articles with #footnotes . Untrained #GROBID performs very badly but that does not mean that it will when properly trained with a good dataset.
Do you want to run the #GROBID PDF-to-#TEI conversion library/server with #Apptainer, for example for #ReferenceExtraction? There was a problem converting the #Docker image, but here's how to solve the problem: https://github.com/kermitt2/grobid/issues/1150#issuecomment-2350942263
Apptainer Support? `stat ~/grobid-service/bin/grobid-service: no such file or directory` · Issue #1150 · kermitt2/grobid

I'm attempting to run Grobid in a HPC (high performance compute) environment, they only support Apptainer. $ apptainer pull docker://grobid/grobid:0.8.0 # ✅ -- creates grobid_0.8.0.sif $ apptainer ...

GitHub

Curious surprise!

Grobid has started using LaTeXML for processing LaTeX inputs (I think just recently), as part of its TEI-based pipeline.

Details at:
https://grobid.readthedocs.io/en/latest/Principles/

#TeXLaTeX #latexml #grobid #TEI

How GROBID works - GROBID Documentation

Has anyone used large language models for extracting (#bibliographic style, e.g. #DublinCore) #metadata from fulltext (PDF) documents? I tried this with a fine-tuned #OpenAI #GPT3 Curie model and the results were outrageously good at least for doctoral theses. Much better than traditional NLP methods like #GROBID.

#AI #machinelearning #LLM

[#BeautifulSoup #Pandas] Parsing TEI XML documents [from #grobid] with Python | Data, code and science https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/
Parsing TEI XML documents with Python

In the previous blogpost, we learned about GROBID which outputs TEI XMLs from PDFs as input. We now attain some hand-on experience with juggling TEI XML documents.

Data, code and science