Mastodawn

We have a CI job to spot unwanted utf8 letters in #curl PRs as we have noticed that GitHub will gladly show the for example (identical) Cyrillic version of a letter next to the Latin version in a diff and it is yes, entirely impossible for a human to spot the diff. I mean the diff is shown, but the significance of it is not.

Changing just a single letter like that in a URL hostname opens up for a world of grief.

Show thread

FOSS Unleashed May 12

@bagder I feel like there needs to be tools that make safer handling of Unicode easier. Anyone know of the full list of Unicode ranges? I know there are some sites that give partial ones. But I'd like the information needed to detect "this sentence contains Unicode characters consistent with language X" vs "this sentence contains Unicode characters for 45 different languages"

Show thread

Peter Krefting

@fossunleashed @bagder All the #Unicode metadata is available at http://unicode.org/Public/

Have a look at https://unicode.org/Public/UNIDATA/Scripts.txt

Index of /Public

Show thread

Peter Krefting May 12

@fossunleashed @bagder These documents are also relevant in this case:

UTR#36: https://www.unicode.org/reports/tr36/
UTR#39: https://www.unicode.org/reports/tr39/

Stabilized Technical Report

Show thread

Klaus Stein May 13

@nafmo @fossunleashed @bagder

Some Regex engines also provide this.
So one could e.g. check for
/\p{Arabic}/ && /\p{Armenian}/ && /\p{Cyrillic}/ && … and give a warning if too many of these match.