We have a CI job to spot unwanted utf8 letters in #curl PRs as we have noticed that GitHub will gladly show the for example (identical) Cyrillic version of a letter next to the Latin version in a diff and it is yes, entirely impossible for a human to spot the diff. I mean the diff is shown, but the significance of it is not.

Changing just a single letter like that in a URL hostname opens up for a world of grief.

@bagder I feel like there needs to be tools that make safer handling of Unicode easier. Anyone know of the full list of Unicode ranges? I know there are some sites that give partial ones. But I'd like the information needed to detect "this sentence contains Unicode characters consistent with language X" vs "this sentence contains Unicode characters for 45 different languages"
Index of /Public

@fossunleashed @bagder These documents are also relevant in this case:

UTR#36: https://www.unicode.org/reports/tr36/
UTR#39: https://www.unicode.org/reports/tr39/

Stabilized Technical Report

@nafmo @fossunleashed @bagder

Some Regex engines also provide this.
So one could e.g. check for
/\p{Arabic}/ && /\p{Armenian}/ && /\p{Cyrillic}/ && … and give a warning if too many of these match.