Mastodawn

goddamn writing a recursive descent parser is a lot harder when you wanna do good syntax error recovery

stop genocide punch nazis Nov 27, 2022

@eniko I realized it way too late, but in case you haven't thought about it yet: designing the syntax so that there are unambiguous toplevel bits helps a lot. "This thing has to start a new definition, so no matter the mess before, I know again where I am."

Show thread

Eniko (moved ➡ gamedev.place)Nov 27, 2022

@nikodemus you mean like "func void foo(){}" instead of just "void foo(){}"?

Show thread

Jochem Kuijpers Nov 27, 2022

@eniko @nikodemus yep, those are referred to as synchronization points; when you encounter them, you just pop all the state from the stack until you're back at parsing top level functions, and report an error.

The semi-colon is a very common one, too. You don't tend to see it mid-expression, only at the end of a statement, so if you encounter one and expect more expression tokens, you get to close the expression and report an error.

Show thread

Eniko (moved ➡ gamedev.place)Nov 27, 2022

@jchmoe @nikodemus yeah, i have those, though atm im mostly only synchronising on } and ;

Show thread

Esteban Küber

Nov 27, 2022

@jchmoe @eniko @nikodemus depending on the language you might have to account for opening and closing braces. For example, in Rust closures can have semicolons inside of a valid closure expression, so what I'd do there is swallow everything inside any new scopes encountered until I find a ; or }

Show thread

stop genocide punch nazis Nov 27, 2022

@ekuber @jchmoe @eniko Very true. The reason I think identifiable toplevel syntax is important is that otherwise unbalanced delimiters tend to blow up much worse in languages with big source files.

Show thread

Esteban Küber

Nov 27, 2022

@nikodemus @jchmoe @eniko my experience in rustc tells me that unbalanced delimiters need to be handled in the parser and not the lexer, which is not what we do 🫤

Show thread

stop genocide punch nazis Nov 27, 2022

@ekuber @jchmoe @eniko Wait, what? rustc _lexer_ deals with unbalanced delims? Do you happen to know why it came to be that way?

Show thread

Esteban Küber

Nov 27, 2022

@nikodemus the proc macros deal in TokenTrees, which need balanced delims, instead of TokenStreams which is ignorant of delims. But by the time the parser kicks in it turns a TokenTree into a TokenStream. There hasn't been a huge need to change it, other than improving delim error recovery.

Show thread

stop genocide punch nazis Nov 27, 2022

@ekuber That makes sense, thanks!

Show thread

stop genocide punch nazis Nov 27, 2022

@eniko Yup, something like that.

Show thread

Eniko (moved ➡ gamedev.place)Nov 27, 2022

@nikodemus yep im doing that :D

Show thread

Esteban Küber

Nov 27, 2022

@eniko even that can be handled if you don't mind having unbounded lookahead in the parser and check for "ident ident openparen". It's recovery, it can afford to be wrong in edge cases.

Show thread

The All-Knowing Cheese AI Nov 27, 2022

@eniko the main problem I have with my own parser is decent error messages, I hesitate to think about error recovery, but it might have to be done in the future ... likely when I have to write a language server for it

Show thread

Eniko (moved ➡ gamedev.place)Nov 27, 2022

@cheese3660 I've always punted on recovery (which really is needed for good error messages) so this time I wanted to do it right

Show thread

The All-Knowing Cheese AI Nov 27, 2022

@eniko welp, I guess I have to kill two birds with one stone somehow ... I guess the easiest way is say ... if it encounters an error go back to the top level and parse from there? But that might miss a lot of stuff. Ill have to look more into this because rn I have the parser die on the first syntax error.

Show thread

Kelly MacNeill Nov 27, 2022

@eniko
error recovery (and reporting) and also SOMEHOW the possibility of incremental parsing has eaten years of my life.

i don’t know if your parser is incremental but it’s a pretty big leap to try to find just the ast nodes associated with a certain character or source code edit.

Show thread

Kelly MacNeill Nov 27, 2022

@eniko
and source insertions are fairly straight forward but source deletions hahahaha just reparse the whole file.

Show thread

Kelly MacNeill Nov 27, 2022

@eniko
and on the topic of error recovery, thank goodness most IDEs automation insert a closing brace on typing ‘(‘ because otherwise the rest of your program is now a function argument.

i guess ive been looking at this from the perspective of interactive parsing for intellisense/ code completion. i suppose for more well formed documents the challenge is less severe.

Show thread

Eniko (moved ➡ gamedev.place)Nov 27, 2022

@pyromuffin fortunately mine is not incremental :'D

Show thread

Doug Gregor Nov 27, 2022

@eniko @thephd We’ve recently been rewriting the Swift parser, and have some opinions on how to do this well: good parser primitives (“expect this”, “match this”, “don’t go beyond that”, etc.), no diagnostics from parsing itself (represent all issues as missing/unexpected nodes in the tree), and handle diagnostics in a post-pass. The result is really nice and is easy to extend to@new grammar terms. See https://github.com/apple/swift-syntax/blob/main/Sources/SwiftParser/SwiftParser.docc/ParserRecovery.md

swift-syntax/ParserRecovery.md at main · apple/swift-syntax

A set of Swift libraries for parsing, inspecting, generating, and transforming Swift source code. - swift-syntax/ParserRecovery.md at main · apple/swift-syntax

GitHub

Show thread

mirabilos Nov 27, 2022

@eniko @dgregor79 @thephd oooh interesting. I fear this works better for languages structured like C than like Korn Shell though, right?

Show thread

Doug Gregor Nov 27, 2022

@mirabilos @eniko @thephd C is tricky because it’s ambiguous without doing name lookup, but beyond that—I think this works as long as you have a grammar to work with. Swift needs a bit of look ahead to resolve ambiguities, and that works fine

Show thread

mirabilos Nov 27, 2022

@dgregor79 @eniko @thephd yeah, for shell, even just POSIX shell, grammar is… a thing. Some things parse very differently depending on context.

Show thread

Paul Cantrell Nov 27, 2022

@dgregor79 I had not seen this. This is magnificent.

Is there any thought of splitting the parser primitives into a separate library? I assume much more maintenance than the team cares to take on, but seems like it could be an amazing resource for other projects.

Show thread

Doug Gregor Nov 27, 2022

@inthehands we haven’t seriously considered it, no. To make it reusable, I think we’d want to separate out the part that generates a syntax tree from a grammar (tons of code gen), and then factor out the parser primitives. It’s a lot of work and we’re not really motivated to do it, not because it’s a bad idea (it would be very cool), but because of the opportunity cost. We’d much rather spend our time building more tooling pieces on top of the new parser and integrate them with the main compiler

Show thread

Paul Cantrell Nov 27, 2022

@dgregor79 Fair, and as expected. And I would totally volunteer to do the library extraction if I didn’t have a family or a job.

Always love these windows into the work you all are doing.

Show thread

Eniko (moved ➡ gamedev.place)Nov 27, 2022

@dgregor79 @thephd ohh i'll have to check this out

Show thread

Peter Sommerlad Nov 27, 2022

@eniko try with an lalr(1) parser generator. my diploma thesis project from 1989 using yacc was about 50% error rules

Show thread

Miguel de Icaza ᯅ🍉Nov 27, 2022

@PeterSommerlad @eniko I originally struggled with error rules and Yaccob in Mono’s C# compiler, but once I got a hang of it, it was both a pleasure to use, but also I used the idioms to implement the REPL intelligence and the old IDE intellicode.

Show thread

Miguel de Icaza ᯅ🍉Nov 27, 2022

@PeterSommerlad @eniko typo, not yaccob- yacc

Show thread

Jules Jacobs Nov 27, 2022

@eniko Maybe I'm suffering from Stockholm syndrome with Coq, but I've started to like that the compiler completely stops at the first error. I think the key thing that makes it work for Coq is that it can completely processes declarations one-by-one, including parsing and type checking, before going to the next declaration.