New blog post!

I've been investigating out how various languages get away with not requiring semicolons.

I looked at 11 languages and found so many interesting cases I had to share!

https://terts.dev/blog/no-semicolons-needed/

#programming #roto

No Semicolons Needed | Terts Diepraam

Maybe the post didn't make this clear enough: I kind of agree with the people who say that we should just require semicolons. I only want to implement optional semicolons if that can be done well.

However, just making semicolons required is also a bit of a reductive argument. There are so many things to take into account! I truly think Gleam doesn't need semicolons for instance.

@terts older languages were all one statement per line, no semicolons. Like Fortran, except it had an explicit continuation column.

Other file formats also stick to one statement per line, but e.g. a leading space on a line makes it a continuation of the preceding line.

The use of semicolons is introduced when you no longer require one statement per line.

@terts I would urge you not to go down this road. Stick to mandatory semicolons. If it were entirely my call, I'd take a step *farther* toward mandatory statement terminators, and make

fn returns_b() -> rettype {
a();
b()
}

a syntax error; you would be obliged to write

fn returns_b() -> rettype {
a();
b();
}

and that would return the value returned by b, unlike Rust and current Roto. To return nothing, you would write

fn returns_unit() {
a();
b();
();
}

@zwol Could you explain why you feel that way? Do you like the explicitness? Do you fear the ambiguity even in the best approaches in this post?
@terts crossed messages, see my self-reply

@terts This is based on extensive experience with C, Rust, Python, Perl, awk, sh, R, and Javascript, and some exposure to Ruby, Go, and Lua (all of which I can read, but avoid using for unrelated reasons).

My experience has been that only the extremes - Python's "the indentation _alone_ determines block structure; use semicolons only to cram multiple statements onto the same line" and C's "you must put a semicolon at the end of every statement" - avoid confusing people with edge cases.

@terts Notably, the R rule that it seemed you liked pretty well, has the nasty consequence that if you want to break a line at an operator (very long chains of expressions are common in R programming due to its |> pipelining operator) you must either parenthesize the entire expression or break the line _after_ the operator. In practice, people break the line after the operator, which I find less readable than breaking it before the operator.

@terts And I dislike Rust's "leave off the last semicolon to make the block evaluate to the value of the last expression instead of to ()" rule because that makes the value of the block change depending on the presence or absence of one character that otherwise has minimal semantic significance, so your brain learns to ignore it.

The type checker will usually flag this when you get it wrong; it would be a much worse problem in a language where bugs like this are runtime errors.

@terts And finally I would argue that the sheer _variety_ of approaches to deciding where an expression ends, in the absence of an explicit terminator, is itself a reason not to go there. Because no matter what you do in Roto, it's going to be different from at least a few other languages, and that'll be a trap for people coming from those languages.
@zwol Yeah you're not wrong there. It definitely something to consider. This post was an exploration of what's out there and I only want to implement something I feel confident in.
@zwol On R: I agree. Somebody on Reddit proposed a `..` at the start of the next line as an alternative. The reason I liked R is mostly that it's unambiguous and simple. I don't think I'll emulate it.
@zwol On Rust's semicolons: I agree that this feature shouldn't exist in a more dynamically typed language. Funnily enough, I think Swift kind of has what you describe but with no lines having semicolons, so all lines are treated equally, but the last expression can be the value that a block evaluates to. They also pull some tricks to not always require the trailing `()` (for better or for worse).
@terts Re "A Different Idea": I totally agree, I'd love to see a language which uses indentation to mark continuation. It feels really natural to me.
@bal4e @terts Maybe you’ve just been working with Internet protocols too much?

@terts In your Gleam example with 1 + 1 1 + 1 here are some more interesting cases to consider:

1 + 1 -1 + 1 (two expressions)
1 + 1 - 1 + 1 (one expression)
1 + 1 -x + 1 (one expression)
1 + 1 - x + 1 (one expression)

I verified on the Gleam playground that this is indeed how they parse.

@terts IIRC, Swift has rules for "operator tokens" that take into account whitespace (and special delimiters) on the left and right side to determine prefix vs postfix vs infix uses, but it doesn't look like Gleam does anything like that. That is not a separable design issue because it interacts with how well "naturally ends" termination works out in practice.

@pervognsen That third Gleam case is...interesting 😄

I think I mention that rule that Swift has in the post!

@terts good survey! another language to look at is haskell which has a fancy “offside rule” for deciding when to add {;} which is mostly but not entirely based on indentation
@terts you're going to have fun if you check scala 😅

@terts The biggest issue with comparing semicolon inference in existing languages:

Most ship with an absolutely half-assed implementation because it's an aspect of syntax that is rarely possible to fix after it shipped.

Sadly, it appears your article managed to include only languages with such half-assed implementations.

@soc I'd love to hear about the full-assed implementations! Which ones do you consider to be good?

@terts Look for any language that checks both the token before and the token after the newline to determine whether a semicolon should be inserted.

For instance in this list, have a look at Scala: https://pling.jondgoodwin.com/post/semicolon-inference/#scala

Semicolon Inference

@terts The token set approach can largely emulate the grammar-driven approach of the languages in your blog (by poorly configuring the tokens in the before/after set), but not the other way around.

@terts The best approach I found to figure out what needs to go into these before/after sets:

Imagine a hypothetical variant of your language where semicolons are required, then treat any difference between that language and your semicolon-inferred language as a bug in the inference rules.

That pretty much decides 98% of the "how should I actually parse this" ambiguities you might encounter, including the "binary operation split across newline gets treated as unary operator" issue.

@terts Most interesting behavior I have seen yet is Matlab, where a statement without a semicolon prints its value to the console and one with a semicolon doesn't :)
@terts one thing swift’s lexer does that you didn’t mention is track whether certain tokens occur at the start of a line or not, particularly brackets. that’s how we distinguish `foo(bar)` from `foo\n(bar)`, and iirc that’s why your `x = a + 2 y` example gets rejected without an explicit semicolon as well

@terts BCPL had optional semicolons. I suspect the rules are the same as in Go, working from memory and also Ken Thompson would probably have some interaction w/ BCPL.

For me hacking in Go, the optional semicolons are no problem at all, never were, perhaps because of decades-ago exposure to BCPL. They really aren't necessary, if logic says they are, I think that implies a flaw in the assumptions. Perhaps it is self-selection, but Go users seem not to care.

@terts The "always use semicolons in javascript" advice is only half a solution - it prevents two statements from accidentally being concatenated to one, but it does NOT prevent the parser from breaking other statements in half.

Having the parser (or lexer) second guessing the coder is a double-edged sword that can easily lead to the coder second guessing the parser :-(

I enjoyed your write-up though!

@terts The SPSS language has a couple of syntax modes.

In one of them, a line that begins at the left margin starts a new command. If the first character on a line is + or - or ., then that character is ignored, which allows new commands to visually start indented except for that prefix character. This probably made some kind of sense on punch cards in the 1960s when the language originated.

The other syntax mode is more sensible. A command ends if its line ends in a period or a blank line.