Updated article: Encoding orders of Brahmic scripts

Documents the encoding orders that the OpenType Universal Shaping Engine assumes for the Brahmic scripts it supports. Understanding encoding orders is necessary when rendering or otherwise interpreting text in these scripts, as well es when entering text using input methods or otherwise generating text.

Updated for Unicode 17.0 and latest USE data.

https://lontar.eu/en/notes/encoding-orders-of-brahmic-scripts/index.html

@lontar This is a very useful resource, Norbert, but I worry that the term ‘encoding order’ as used here may confuse some people, since what you are describing is the order required for cluster shaping at the glyph level, which in various ways differs from the order in which the text is encoded at the character level (hence the need for and usefulness of the article). What you call ‘encoding order’ in the article, is what I would call ‘shaping order’, to distinguish it from the underlying order of the text encoding.

I wonder how other people think/talk about this?

@TiroTypeworks

I do mean “encoding order”, the order in which characters should appear within a Unicode character sequence. The Universal Shaping Engine defines its cluster model in terms of Unicode character properties, and validates before any glyph-level processing, so it seems to agree with that.

One issue is the USE’s lack of compatibility with Unicode normalization. The USE applies some decompositions, and rendering systems may apply more steps (decomposition, reordering, composition) before passing text to the USE. However, the USE is not designed to guarantee that text is rendered identically independent of normalization. See USE bugs 905 and 568.

Do you know of other issues?

I’m not familiar with “shaping order” – can you point me to a definition?

@lontar So ‘encoding order’ is the order of characters in the underlying text string? or something else? To ask the question another way: ‘the order in which characters should appear within a Unicode character sequence’ at what stage of processing?

‘Shaping order’ is the term I use for the input order at the beginning of glyph processing: i.e. the order of the characters at the moment when layout passes from character-level processing to glyph level processing. So that would mean after any initial reordering of characters as part of the shaping operations.

I suspect, based on my understanding of your tables, that we mean similar things with different terms. I find the word ‘encoding’ very broad and hence vague.

@TiroTypeworks

I think there should be well-defined encoding orders for Brahmic scripts shared between all stages of processing. Sadly the Unicode Standard doesn’t define such encoding orders, or defines them incompletely, or (for Khmer) defines one that’s unworkable. So for the scripts supported by the Universal Shaping Engine the encoding orders defined by the USE are currently your best bet.

I wrote in much more detail about this topic in “Order and disorder in Unicode”:

https://lontar.eu/en/notes/order-and-disorder-in-unicode/

Order and disorder in Unicode

@TiroTypeworks

“The moment when layout passes from character-level processing to glyph-level processing” isn’t well-defined in the Universal Shaping Engine – see USE bug 270.

An interesting point to look at glyph order is just before application of lookups starts. By then shaping systems have processed the input character sequence in several steps:

• (optional) apply full or partial Unicode normalization

• (USE) apply specified partial Unicode normalization

• (some shaping engines, incl. USE) inserted dotted circles to ensure valid clusters

• (some shaping engines, not USE) decompose, insert, reorder specific characters

• map characters to glyphs per cmap, possibly applying (de-)composition

Would that be your shaping order?

@lontar Yes, what I call ‘shaping order’ is the state of the string at the end of any pre-processing of character order immediately before, in OTL terms, GSUB begins. It is the order when the shaping engine stops processing character codes and starts processing glyph IDs. I am happing to entertain other terms for this, but I do think it is helpful to be able to indentify this order, because it directly determines, via cmap, the starting input order of glyphs for the lookups.

I now understand, if correct, that your ‘encoding order’ is the step previous to the kinds of pre-processing of character order you describe. Yes?

@lontar I have the sense that we have too few terms to unambiguously label the various stages and states of character ordering during text processing and display. E.g. ‘input character sequence’ as you use here: that could either mean the sequence as input by the author, or the sequence as taken as input by the shaping engine (which might have been normalised), or the sequence as provided as input to glyph processing.

@TiroTypeworks

“Input character sequence” is indeed a generic term, and needs to be tied to a process or algorithm to which the character sequence is an input. In this case, it’s an input to a shaping system, such as HarfBuzz.

@TiroTypeworks

I might have called your “shaping order” the “initial glyph sequence”, but as long as we know what we mean, the name doesn’t matter that much to me. And yes, this point in shaping is important because it’s the first time the font gets to see and influence what’s going on.

No, “encoding order” is not tied to any particular step. I use the term to specify how text for a given script *should be* encoded, what we consider “correct”. An actual character sequence may not conform to the specified encoding order. That’s why the USE and some other shaping engines validate their input and insert dotted circles where text does not conform to the specified encoding order.