goddamn writing a recursive descent parser is a lot harder when you wanna do good syntax error recovery
@eniko I realized it way too late, but in case you haven't thought about it yet: designing the syntax so that there are unambiguous toplevel bits helps a lot. "This thing has to start a new definition, so no matter the mess before, I know again where I am."
@nikodemus you mean like "func void foo(){}" instead of just "void foo(){}"?

@eniko @nikodemus yep, those are referred to as synchronization points; when you encounter them, you just pop all the state from the stack until you're back at parsing top level functions, and report an error.

The semi-colon is a very common one, too. You don't tend to see it mid-expression, only at the end of a statement, so if you encounter one and expect more expression tokens, you get to close the expression and report an error.

@jchmoe @nikodemus yeah, i have those, though atm im mostly only synchronising on } and ;
@jchmoe @eniko @nikodemus depending on the language you might have to account for opening and closing braces. For example, in Rust closures can have semicolons inside of a valid closure expression, so what I'd do there is swallow everything inside any new scopes encountered until I find a ; or }
@ekuber @jchmoe @eniko Very true. The reason I think identifiable toplevel syntax is important is that otherwise unbalanced delimiters tend to blow up much worse in languages with big source files.
@nikodemus @jchmoe @eniko my experience in rustc tells me that unbalanced delimiters need to be handled in the parser and not the lexer, which is not what we do 🫤
@ekuber @jchmoe @eniko Wait, what? rustc _lexer_ deals with unbalanced delims? Do you happen to know why it came to be that way?
@nikodemus the proc macros deal in TokenTrees, which need balanced delims, instead of TokenStreams which is ignorant of delims. But by the time the parser kicks in it turns a TokenTree into a TokenStream. There hasn't been a huge need to change it, other than improving delim error recovery.
@eniko even that can be handled if you don't mind having unbounded lookahead in the parser and check for "ident ident openparen". It's recovery, it can afford to be wrong in edge cases.
@eniko the main problem I have with my own parser is decent error messages, I hesitate to think about error recovery, but it might have to be done in the future ... likely when I have to write a language server for it
@cheese3660 I've always punted on recovery (which really is needed for good error messages) so this time I wanted to do it right
@eniko welp, I guess I have to kill two birds with one stone somehow ... I guess the easiest way is say ... if it encounters an error go back to the top level and parse from there? But that might miss a lot of stuff. Ill have to look more into this because rn I have the parser die on the first syntax error.

@eniko
error recovery (and reporting) and also SOMEHOW the possibility of incremental parsing has eaten years of my life.

i don’t know if your parser is incremental but it’s a pretty big leap to try to find just the ast nodes associated with a certain character or source code edit.

@eniko
and source insertions are fairly straight forward but source deletions hahahaha just reparse the whole file.

@eniko
and on the topic of error recovery, thank goodness most IDEs automation insert a closing brace on typing ‘(‘ because otherwise the rest of your program is now a function argument.

i guess ive been looking at this from the perspective of interactive parsing for intellisense/ code completion. i suppose for more well formed documents the challenge is less severe.

@pyromuffin fortunately mine is not incremental :'D
@eniko @thephd We’ve recently been rewriting the Swift parser, and have some opinions on how to do this well: good parser primitives (“expect this”, “match this”, “don’t go beyond that”, etc.), no diagnostics from parsing itself (represent all issues as missing/unexpected nodes in the tree), and handle diagnostics in a post-pass. The result is really nice and is easy to extend to@new grammar terms. See https://github.com/apple/swift-syntax/blob/main/Sources/SwiftParser/SwiftParser.docc/ParserRecovery.md
swift-syntax/ParserRecovery.md at main · apple/swift-syntax

A set of Swift libraries for parsing, inspecting, generating, and transforming Swift source code. - swift-syntax/ParserRecovery.md at main · apple/swift-syntax

GitHub
@eniko @dgregor79 @thephd oooh interesting. I fear this works better for languages structured like C than like Korn Shell though, right?
@mirabilos @eniko @thephd C is tricky because it’s ambiguous without doing name lookup, but beyond that—I think this works as long as you have a grammar to work with. Swift needs a bit of look ahead to resolve ambiguities, and that works fine
@dgregor79 @eniko @thephd yeah, for shell, even just POSIX shell, grammar is… a thing. Some things parse very differently depending on context.

@dgregor79 I had not seen this. This is magnificent.

Is there any thought of splitting the parser primitives into a separate library? I assume much more maintenance than the team cares to take on, but seems like it could be an amazing resource for other projects.

@inthehands we haven’t seriously considered it, no. To make it reusable, I think we’d want to separate out the part that generates a syntax tree from a grammar (tons of code gen), and then factor out the parser primitives. It’s a lot of work and we’re not really motivated to do it, not because it’s a bad idea (it would be very cool), but because of the opportunity cost. We’d much rather spend our time building more tooling pieces on top of the new parser and integrate them with the main compiler

@dgregor79 Fair, and as expected. And I would totally volunteer to do the library extraction if I didn’t have a family or a job.

Always love these windows into the work you all are doing.

@eniko try with an lalr(1) parser generator. my diploma thesis project from 1989 using yacc was about 50% error rules
@PeterSommerlad @eniko I originally struggled with error rules and Yaccob in Mono’s C# compiler, but once I got a hang of it, it was both a pleasure to use, but also I used the idioms to implement the REPL intelligence and the old IDE intellicode.
@eniko Maybe I'm suffering from Stockholm syndrome with Coq, but I've started to like that the compiler completely stops at the first error. I think the key thing that makes it work for Coq is that it can completely processes declarations one-by-one, including parsing and type checking, before going to the next declaration.