I have asked Claude Opus 4.6 (via GitHub Copilot Chat) to summarize various approaches to XML-plaintext-NLP-XML roundtripping, providing it with the respective GitHub repositories (listed in the report).

Claude finds FIVE different approaches. IMHO, in some cases I think it misses where it should have gone into details, but for an overview it is quite good. What do you think?

https://pad.gwdg.de/wwNnTvaETHKuzFiyhIYHog?view#Response-Report-Approaches-to-XML%E2%86%94Plaintext-Conversion-with-Annotation-Preservation

@eeditiones @davidlassner @TEIConsortium
@aboutgeo @cmboulanger

#TEIXML #NLP #StandoffAnnotation #TEIPublisher #Recogito

(TEI) XML plaintext Roundtripping Review - HedgeDoc

@anwagnerdreas @eeditiones @davidlassner @TEIConsortium @cmboulanger

Thanks for sharing! I'll read through the details later with interest!

FWIW: Claude's assessment of "Family E" might be based more on the eeditiones repo, perhaps. But at least it only vaguely relates to text-annotator-js ;-)

As far as Recogito is concerned, here's what we are using for exactly the described use case instead:

https://github.com/recogito/tei-standoffconverter-js

It's essentially a TypeScript port of the "Family A" code, slightly modified to our use case.

GitHub - recogito/tei-standoffconverter-js: Convert between TEI/XML and plaintext without losing markup context.

Convert between TEI/XML and plaintext without losing markup context. - recogito/tei-standoffconverter-js

GitHub
@aboutgeo Ah, sorry, I was just under the general impression that Recogito does something like it, but wasn't sure at all where the respective code could be found. So I submitted a couple of recogito repos that all sounded a bit like they were good candidates. It seems, for whatever reason, I missed the one you mentioned. Thanks for pointing me to it.

@anwagnerdreas no worries โ€“ the code is pretty scattered and there are a lot of repositories :-)

text-annotator-js is specifically for handling (manual) annotation interaction on TEI/XML rendered with CETEIcean in the browser. There's only a bit of mapping between DOM selection ranges and TEI XPath expressions.

The other repo I shared is for your exact use case, and we use it to generate "text-annotator-js-compatible" annotations from NER.

Happy to chat about things if you want to know more about it!

@aboutgeo

Yes, that's also something I noticed in the report: it seems a bit contingent where Claude goes into detail and looks up things and where it relies on hunches, speculation or keywords that seem to suggest how something is solved in fact. Well, running the whole thing in GitHub Copilot gives it access to the explicitly provided repositories, but prevents it from doing other web searches. Maybe it would have been better to put the question in a different setting with autonomous web search enabled...

@aboutgeo @anwagnerdreas @eeditiones @davidlassner @TEIConsortium @cmboulanger It's not full round-tripping, but perhaps of interest: At the KNAW Humanities
Cluster we developed a pipeline for the direction TEI XML (or any XML for that matter) to plain text and stand-off annotations (supporting W3C Web Annotations). We do so using STAM (https://annotation.github.io/stam/), one of its tools (`stam fromxml`) allows us to define a mapping from XML to text and stand-off annotations: https://annotation.github.io/stam/specs/tools/docs/fromxml/
STAM: Stand-off Text Annotation Model

STAM is a standalone data model for stand-off annotation on text. It allows you to describe annotations on text in your own terms and offers practical tooling to do so.

@proycon @aboutgeo @eeditiones @davidlassner @TEIConsortium @cmboulanger

Right, thank you! In fact, in an earlier chat I had included STAM, here is a comparison where STAM features as well: https://github.com/copilot/share/081e422c-0a24-80f7-b010-ac40c481281f

Since at the time I had forgotten to include Recogito and that STAM is not full roundtripping, I had it have another go with the link I posted in the original toot...

GitHub Copilot

AI that builds with you

GitHub
@anwagnerdreas
Ha, it was nice the overview cited the "tei-annotator" as "Remarkable: The context-anchor resolution approach (finding spans by searching for context strings in the source) is unique and necessary because LLMs donโ€™t produce reliable character offsets." - while I wrote none of the code, the idea was actually prompted and wasn't proposed by the agent.๐Ÿ™‚