Mastodawn

I wonder if anyone can suggest why text extraction from a PDF doesn't work for Burmese? Here I'm copying the text from a PDF in the first image and pasting into TextMate in the second.

The question came up in our webinar yesterday and I thought it might be to do with glyph naming, but this font has uniXXXX names. Hadn't realised this was such a bad bad problem that users are facing every day.

Is this reportable somewhere?

Show thread

Bobby de Vos

@ohbendy I am not quite sure what is happening in this example, but in general, I would have thought that the PDF would have needed ActualText specified. Otherwise, if you have U+1000 (ka), U+1031 (e), the e needs to be displayed to the left of the ka, so the glyph stream would be uni1031, uni1000. How is the PDF reader, without ActualText, going to be able to reverse the re-ordering? Likewise with U+1000, U+1039 (virama), U+1000 the glyph stream is going to be uni1000, uni1000.sub. (no virama)

Show thread

Khaled Hosny Dec 9

@devosb @ohbendy IIRC, Microsoft’s XPS had some sort of cluster mapping similar to what one gets out of shapers (like HarfBuzz) so it can handle decomposition and re-ordering and possibly even BiDi.