A thing that has always frustrated me about github/bitbucket, as a language designer, is that you can't teach the forge to syntax highlight files in your own custom formats.

Now the existence of Codeberg/git.gay means potentially I could create a PR to forgejo to add this feature and it would get added to the forges I actually use. Perhaps at some point I will do this.

Anybody know off the top of their heads what syntax-highlighter format Codeberg/Forgejegejo even uses?

Oh. it's… oh.

It's… custom Go code… on a per-language basis. they use something called Chroma and the way Chroma works is it wrote custom lexers in Go for each language they want to support. Um. Hm.

This is actually the one single approach they could have attempted which prevents custom pluggable highlighters on a per-repo basis.

https://hey.hagelb.org/@technomancy/statuses/01KNQJ9H3R64BEHE1QWNBXZKVW

technomancy (@[email protected])

@mcc last I checked it was https://github.com/alecthomas/chroma ;  I remember sending a patch to support Fennel and it was handled pretty promptly

hey.hagelb.org

So like, a thing I've been working on is a flyweight Lisp interpreter that I can embed into other programs in trivial ways. The entire appeal of this Lisp is that it's not bound to any one standard and is defined on a per-project basis, so I can make complete changes to the language on a repo-to-repo or commit-to-commit basis.

For example, here I made some oneoff changes and transformed my LISP into an embedded macroassembler for z80 assembly.

https://git.gay/mcc/mermaid-nil/src/branch/project-twice/project/twice/game.l0

I can't submit upstream.

mcc/mermaid-nil

Work-in-project games for retro consoles.

git.gay
@mcc OMG git.gay, what an amazing name for a GIT host. 🤣
@mcc you would really think that "one widely-supported declarative non-executable grammar format for syntax highlighting" would be a solved problem by now but it kinda feels like tree sitter is sucking up all the oxygen in that space; don't love how that's going
@technomancy Do you have negative opinions about treesitter, and if so, why?

@mcc the main thing is that grammars are opaque blobs of executable code, which sucks! technically you can often compile them from a declarative data source but afaict you can't do this without npm

a well-designed format would make the unit of distribution a purely declarative data format but instead we ended up with this situation where you could install a grammar that segfaults your editor or steals your SSH keys; gross!

@technomancy in treesitter, they are?

yikes

@mcc some editors have configuration flows where it's just like "ok imma just curl a tree-sitter grammar .so file from gods know where and load it directly into the process; hope everything's fine and nothing bad happens!"

it's like the flow you'd come up with if you were a supply-chain attacker trying to maximize attack surface

Introducing arborium, a tree-sitter distribution

About two weeks ago I entered a discussion with the docs.rs team about, basically, why we have to look at this: When we could be looking at this: And of course, as always, there are reasons why thi...

fasterthanli.me

@c0dec0dec0de @technomancy *rubbing eyes* which problem does this fix, exactly? is the idea that instead of pulling in .SO's you pull in .WASMs and that's safer?

Unfortunately I don't think that's adequate for codeberg's purposes as they need to worry not only about VM breaks but also about DOS attacks. Nothing prevents wasm from simply being very slow.

@mcc @technomancy I think the idea is that you have all the bits together and accounted for rather than fetching random libraries from elsewhere.
@c0dec0dec0de @technomancy Chroma, the Go app, handles this by simply hardcoding every supported language into the executable.
@mcc @c0dec0dec0de I think they're saying this might be helpful to ameliorate some of the tragic downsides of tree-sitter in the common text editors config case, but not for your forge-specific use case
@mcc
It should be pretty easy to give the WASM VM time budget to protect against that.
@c0dec0dec0de @technomancy
@robin @mcc @c0dec0dec0de @technomancy Going that route an alternative would be to run the highlighter as a separate process and talk LSP with it. Not sure if it's worth the complexity in setup though..
@mcc @c0dec0dec0de That particular project also introduces the other problem of LLM usage (disclaimer in README, Amos being very pro-slop)
@jamesnvc @mcc yeah, that’s still regrettable.

@mcc @technomancy The parsing itself is driven by a state machine, and you do get a description of those in pure-data JSON files. So in theory you could reimplement the parsing algorithm in another language and not have to dlopen random .so files.

But no one has tried that, because the scanners are largely implemented as arbitrary C/C++ code, and so reimplementing the parser driver wouldn't be enough.

@dcreager @technomancy Well, if I'm implementing my scanner on purpose, and I'm intending to deploy in a situation , is this a viable option for like… me? (Let's never mind that codeberg doesn't actually use treesitter.)

@mcc @technomancy Hmm, not sure I completely understand the question, so apologies in advance if I answer the wrong thing.

So if the question is "could I write a pure-Go tree-sitter clone, which would work with existing grammars, and which would support a runtime-open set of grammars without having to load untrusted .so files", then I think the short answer is "close but not quite".

@mcc @technomancy

When you run `tree-sitter generate`, you get a parser.c file, but if you look at it, there's no _code_ in it, it's just a bunch of static arrays encoding the parse table. The actual parsing code is in the tree-sitter runtime library [1], and you could definitely port that to any language you choose.

[1] https://github.com/tree-sitter/tree-sitter/blob/master/lib/src/parser.c

tree-sitter/lib/src/parser.c at master · tree-sitter/tree-sitter

An incremental parsing system for programming tools - tree-sitter/tree-sitter

GitHub

@mcc @technomancy

If your lexing can be expressed via regexps, then you don't need an external scanner, and that would be enough. You could distribute the parser state tables as data files, not as "static arrays in a .so", and you'd be good to go.

@mcc @technomancy

But if you need non-regular lexing, then tree-sitter does not have any declarative syntax for that. It's just code. You could ask grammar authors to e.g. write a scanner.go in addition to scanner.c, but at the end of the day, that's still untrusted code if you're loading those dynamically at runtime.

@dcreager @mcc @technomancy It's absolutely mind-boggling how it's necessary to write C code to deal with things like indentation, whitespace-sensitivity, semicolon inference etc.

@technomancy @mcc Treesitter is the biggest engineering and design trash fire that I have seen in a long time.

If I had to give people advice on how to tackle the problem of grammars and editor support, I'd point them to TreeSitter and tell them to *not* do that.

#treesitter

@technomancy @mcc It's as if they looked at the existing problems and requirements, and then tried coming up with the dumbest "solutions" just for shits and giggles.

Even if I tried, I wouldn't be able to come up with the sheer density of painfully wrong decisions they made.

It's all "I'd love to understand the state of their mind that led them to believing that shit to be a valid design/engineering option" the way down.

#treesitter

@soc @technomancy Are you aware of any formats that are actively good for this, or is textmate the one option for data-driven approaches?

@mcc @technomancy No, I don't think good formats exist right now:

There is a rather empty niche between these simplistic, easy-to-build regex grammars and fully-featured IDE plugins.

TextMate2 grammars (and by extension Chroma grammars ... any basic syntax highlighting has likely TM2 grammars as a common ancestor) are "good" largely because of the little effort you need to spend to get things working.

@mcc @technomancy This was enough to get pretty nice syntax highlighting in IntelliJ (and other editors):

https://codeberg.org/core-lang/core/src/branch/main/tooling/core.tmbundle/Syntaxes/core.tmLanguage.json

The even more bare-bone Chroma grammar that's used on Codeberg looks like this:

https://github.com/alecthomas/chroma/blob/master/lexers/embedded/core.xml

If I wanted anything fancier, I would likely not invest further time into these grammars, but start implementing a language server or an IDE plugin.

core/tooling/core.tmbundle/Syntaxes/core.tmLanguage.json at main

core - compiler, runtime and standard library of the Core programming language –

Codeberg.org
@soc @technomancy hm, I'm confused. I glanced at the Chroma repo and it looked like Chroma was treesitter like , with each language being handled by custom Go code. Was I missing something?
@soc @technomancy put a different way: Say codeberg uses chroma. Could chroma be made to extract one of these xml files out of a directory in a codeberg repo and syntax highlight based on the local xml file? Would chroma be easy to patch to dynamically load such an xml file at runtime?

@mcc @technomancy I think you are right in this regard – the big difference from my perspective is how the tools are used and sold respectively:

On the one hand, you have this big binary for various languages that you run to generate syntax highlighted code offline.

On other hand, you have this big binary for a sole language, sold as a magical IDE thing that claims to solve all issues, applies some magic diffing, has completely insufficient documentation, requires writing unsafe C hooks, etc.

@mcc @technomancy I don't know – I'd expect Chroma to be much closer to that model than TreeSitter though.

@mcc @technomancy Ideally, you'd have one grammar format that can express "all" languages (i. e. without needing C hooks) and then "everyone" can write their interpreter/runtime for that format as they see fit.

The current situation is not good in that regard. TextMate2 grammars have the widest support, but are rather limited when trying to "fully" parse a language.

@mcc That really sucks, oof.
@xgranade I see the argument for it if you plan to support only a closed set of languages, but it does kinda put a bullet directly in the head of my goal of "support an open set of languages"
@mcc Yeah, it sucks in a completely reasonable way, but it still sucks...
@mcc @xgranade It might be neat if you could generate a parallel directory structure of language-independent substring attributes in a precommit hook that can be interpreted by the syntax highlighter when viewing the original file. Like a source map meets RTF.

@mcc the flip side of this is projects which do allow third-party syntax highlighters, but only in the form of general purpose javascript-based plugins that run with max permissions, because everything must be turing-complete and sandboxes don’t exist.

This feels like a problem that should have been solved 20+ years ago 😕

@serriadh we have multiple twenty, thirty year old standards for it… but that's the problem… we have multiple
@mcc last I checked it was https://github.com/alecthomas/chroma ;  I remember sending a patch to support Fennel and it was handled pretty promptly
GitHub - alecthomas/chroma: A general purpose syntax highlighter in pure Go

A general purpose syntax highlighter in pure Go . Contribute to alecthomas/chroma development by creating an account on GitHub.

GitHub

@technomancy @mcc That's also my experience.

I sent Chroma a grammar for my language, waited until Forgejo updated its Chroma version and now I have syntax highlighting in Codeberg!

Example: https://codeberg.org/core-lang/core/src/branch/main/stdlib/string.core

core/stdlib/string.core at main

core - compiler, runtime and standard library of the Core programming language –

Codeberg.org
@soc @technomancy I guess, in this case what I'm interested in is I am doing experiments with small unfinished languages whose grammars are not finalized, and which I don't expect people to use, but I'd like to show it to people to have languages about language design. I was hoping to solve the larger problem of small situational things like WIP language impls or like, file formats for a single application. I guess maybe I just use something other than codeberg to visualize them.

@mcc format?

lol maybe i am just that irreparably embittered but my first thought was i bet it's a bunch of ad hoc parsers

@mcc god i am so good at predicting things being shit, it's like my truest skill
@dysfun @mcc The big benefit of the TextMate2-style approach is that the grammar does not have to be complete to be useful for highlighting.