Mastodawn

King Naga Calyo Lucere-Delphi Aug 11, 2025

Last night thanks to @orman I had the epiphany of why:

XML parsers don't use or even need regex to parse XML in the first place.

XML parsers go through the text one char at a time, and if they encounter a <, >, </, or />, those chars form flags that signal to the parser if it's entering or leaving a tag, and whether it's a closing or self-closing tag respectively, all of which changes the parsing rules and builds a node tree on the fly.

This will be useful for the UTC.

Show thread

King Naga Calyo Lucere-Delphi Aug 11, 2025

Although DOMParser() does not provide the functionality I desire, I can build my own rudimentary parser that does provide the functionality I desire, and builds a DOM with the structure that I need for processing raw input to formatted output.

@orman

Show thread

King Naga Calyo Lucere-Delphi

This approach is actually going to be A LOT more flexible than attempting to develop The One Regex To Rule Them All for the UTC, and it'll also be A LOT less computationally expensive, since parsing a string by iterating straight through it is just a for loop with a switch statement inside.

String.prototype.matchAll() I'm pretty sure does the same, but with the added overhead of a regex engine trying to match patterns starting from where each previous match ends.

@orman

Show thread

Rubber Nero BLN Aug 11, 2025

@dragonarchitect @orman What about a SAX based parser? One pass, event driven.

Show thread

Orman Aug 11, 2025

@rubber_nero_bln @dragonarchitect a one-pass approach was kinda what I was suggesting in the Discord conversation because outright constructing a parse tree is kinda overly elaborate, and a top down approach involves heavy recursion which I'm not sure would be approachable

Show thread

King Naga Calyo Lucere-Delphi Aug 11, 2025

@orman @rubber_nero_bln Yeah I was thinking of a one-pass as well and just using the angle brackets as literal flags in the input to switch the parsing logic. Then going word by word and processing the input accordingly.

Show thread

Orman Aug 11, 2025

@dragonarchitect @rubber_nero_bln you can technically do it that way but IMO it'll be a lot more ergonomic to do one and a half passes by doing a matchAll to find where all the tags are and then using the indices you get to chop up the input into a sequence of tags and literal text fragments. Otherwise you'll have two parsers effectively inside each other because you need to parse the tag one character at a time and then on top of that, parse the tree based on each other tag as you complete the inner parse loop

Show thread

Orman Aug 11, 2025

@dragonarchitect @rubber_nero_bln the latter is technically the more efficient method, but the regex involved should be simple enough and the inputs short enough it won't matter