I’m training up some machine learning models to 'disambiguate homographs' (aka heteronyms, or words with identical Latin alphabet spellings but different pronunciations like β€˜bow/bow’, β€˜tear/tear’). This will help solve one of the more annoying aspects of auto transliteration into #Shavian. It is both very exciting and intensely boring, since I’m having to make the data sets. Hard to believe there is almost no publicly available training data for this. #𐑖𐑱𐑝𐑾𐑯
𐑷𐑀𐑕𐑴, 𐑣𐑧𐑀𐑴—𐑦𐑑𐑕 π‘šπ‘°π‘― 𐑩 𐑒𐑲𐑀! #𐑖𐑱𐑝𐑾𐑯
@shavian 𐑒𐑧𐑀𐑒π‘ͺπ‘₯ π‘šπ‘¨π‘’!
@shavian 𐑣𐑱, 𐑒𐑧𐑀𐑒𐑩π‘₯ π‘šπ‘§π‘’!
@shavian Sounds like a great project! I have spent many an hour manually resolving homographs :)
@shavian This is a big problem with Persian texts too
@shavian Seems like part of speech classification would cover most of these.
@jdonland Most cases yes, but there are still about 80 or so words that are troublesome (e.g. if you count bow, bows, bowing, bowed as separated words).