I have asked Claude Opus 4.6 (via GitHub Copilot Chat) to summarize various approaches to XML-plaintext-NLP-XML roundtripping, providing it with the respective GitHub repositories (listed in the report).

Claude finds FIVE different approaches. IMHO, in some cases I think it misses where it should have gone into details, but for an overview it is quite good. What do you think?

https://pad.gwdg.de/wwNnTvaETHKuzFiyhIYHog?view#Response-Report-Approaches-to-XML%E2%86%94Plaintext-Conversion-with-Annotation-Preservation

@eeditiones @davidlassner @TEIConsortium
@aboutgeo @cmboulanger

#TEIXML #NLP #StandoffAnnotation #TEIPublisher #Recogito

(TEI) XML plaintext Roundtripping Review - HedgeDoc

@anwagnerdreas @eeditiones @davidlassner @TEIConsortium @cmboulanger

Thanks for sharing! I'll read through the details later with interest!

FWIW: Claude's assessment of "Family E" might be based more on the eeditiones repo, perhaps. But at least it only vaguely relates to text-annotator-js ;-)

As far as Recogito is concerned, here's what we are using for exactly the described use case instead:

https://github.com/recogito/tei-standoffconverter-js

It's essentially a TypeScript port of the "Family A" code, slightly modified to our use case.

GitHub - recogito/tei-standoffconverter-js: Convert between TEI/XML and plaintext without losing markup context.

Convert between TEI/XML and plaintext without losing markup context. - recogito/tei-standoffconverter-js

GitHub
@aboutgeo @anwagnerdreas @eeditiones @davidlassner @TEIConsortium @cmboulanger It's not full round-tripping, but perhaps of interest: At the KNAW Humanities
Cluster we developed a pipeline for the direction TEI XML (or any XML for that matter) to plain text and stand-off annotations (supporting W3C Web Annotations). We do so using STAM (https://annotation.github.io/stam/), one of its tools (`stam fromxml`) allows us to define a mapping from XML to text and stand-off annotations: https://annotation.github.io/stam/specs/tools/docs/fromxml/
STAM: Stand-off Text Annotation Model

STAM is a standalone data model for stand-off annotation on text. It allows you to describe annotations on text in your own terms and offers practical tooling to do so.

@proycon @aboutgeo @eeditiones @davidlassner @TEIConsortium @cmboulanger

Right, thank you! In fact, in an earlier chat I had included STAM, here is a comparison where STAM features as well: https://github.com/copilot/share/081e422c-0a24-80f7-b010-ac40c481281f

Since at the time I had forgotten to include Recogito and that STAM is not full roundtripping, I had it have another go with the link I posted in the original toot...

GitHub Copilot

AI that builds with you

GitHub
@anwagnerdreas
Ha, it was nice the overview cited the "tei-annotator" as "Remarkable: The context-anchor resolution approach (finding spans by searching for context strings in the source) is unique and necessary because LLMs don’t produce reliable character offsets." - while I wrote none of the code, the idea was actually prompted and wasn't proposed by the agent.🙂