A thing that has always frustrated me about github/bitbucket, as a language designer, is that you can't teach the forge to syntax highlight files in your own custom formats.

Now the existence of Codeberg/git.gay means potentially I could create a PR to forgejo to add this feature and it would get added to the forges I actually use. Perhaps at some point I will do this.

Anybody know off the top of their heads what syntax-highlighter format Codeberg/Forgejegejo even uses?

Oh. it's… oh.

It's… custom Go code… on a per-language basis. they use something called Chroma and the way Chroma works is it wrote custom lexers in Go for each language they want to support. Um. Hm.

This is actually the one single approach they could have attempted which prevents custom pluggable highlighters on a per-repo basis.

https://hey.hagelb.org/@technomancy/statuses/01KNQJ9H3R64BEHE1QWNBXZKVW

technomancy (@[email protected])

@mcc last I checked it was https://github.com/alecthomas/chroma ;  I remember sending a patch to support Fennel and it was handled pretty promptly

hey.hagelb.org
@mcc you would really think that "one widely-supported declarative non-executable grammar format for syntax highlighting" would be a solved problem by now but it kinda feels like tree sitter is sucking up all the oxygen in that space; don't love how that's going
@technomancy Do you have negative opinions about treesitter, and if so, why?

@mcc the main thing is that grammars are opaque blobs of executable code, which sucks! technically you can often compile them from a declarative data source but afaict you can't do this without npm

a well-designed format would make the unit of distribution a purely declarative data format but instead we ended up with this situation where you could install a grammar that segfaults your editor or steals your SSH keys; gross!

@technomancy in treesitter, they are?

yikes

@mcc @technomancy The parsing itself is driven by a state machine, and you do get a description of those in pure-data JSON files. So in theory you could reimplement the parsing algorithm in another language and not have to dlopen random .so files.

But no one has tried that, because the scanners are largely implemented as arbitrary C/C++ code, and so reimplementing the parser driver wouldn't be enough.

@dcreager @technomancy Well, if I'm implementing my scanner on purpose, and I'm intending to deploy in a situation , is this a viable option for like… me? (Let's never mind that codeberg doesn't actually use treesitter.)

@mcc @technomancy Hmm, not sure I completely understand the question, so apologies in advance if I answer the wrong thing.

So if the question is "could I write a pure-Go tree-sitter clone, which would work with existing grammars, and which would support a runtime-open set of grammars without having to load untrusted .so files", then I think the short answer is "close but not quite".

@mcc @technomancy

When you run `tree-sitter generate`, you get a parser.c file, but if you look at it, there's no _code_ in it, it's just a bunch of static arrays encoding the parse table. The actual parsing code is in the tree-sitter runtime library [1], and you could definitely port that to any language you choose.

[1] https://github.com/tree-sitter/tree-sitter/blob/master/lib/src/parser.c

tree-sitter/lib/src/parser.c at master · tree-sitter/tree-sitter

An incremental parsing system for programming tools - tree-sitter/tree-sitter

GitHub

@mcc @technomancy

If your lexing can be expressed via regexps, then you don't need an external scanner, and that would be enough. You could distribute the parser state tables as data files, not as "static arrays in a .so", and you'd be good to go.

@mcc @technomancy

But if you need non-regular lexing, then tree-sitter does not have any declarative syntax for that. It's just code. You could ask grammar authors to e.g. write a scanner.go in addition to scanner.c, but at the end of the day, that's still untrusted code if you're loading those dynamically at runtime.

@dcreager @mcc @technomancy It's absolutely mind-boggling how it's necessary to write C code to deal with things like indentation, whitespace-sensitivity, semicolon inference etc.