Mastodawn

Draft #course #outline: #Subword Units for #Speech and #Language #Technologies

Formal and functional operations. Typology. IA. Morphological segmentation. Lexemes and WP analysis. Lemmatization, reinflection, paradigm completion.

Module 2: #Orthography

Descriptive phonetics, IPA, and phonemes. Unicode. Typology of orthographies. Tokenization. G2P and P2G.

Module 3: #Acoustic #Phonetics

General acoustics. DSP. Acoustic analysis. Applications of phonetics.

Show thread

Utku Turk Nov 20, 2022

@davidmortensen Would love to help out in case you need anything from Turkish or any discussions on how morphology can be interesting and hard.

Show thread

David Mortensen Nov 20, 2022

@utkuturk Thanks! Turkish examples are always welcome since they are simultaneously approachable and challenging. Good examples of #morphology as an issue in #Turkish #NLProc are especially welcome.

Show thread

Utku Turk Nov 20, 2022

@davidmortensen *shameless ad*, you might like the first paragraph and table in section 2 of this paper: https://link.springer.com/article/10.1007/s10579-021-09558-0#Sec2 We show that even a simple word can have 8 different morphological parses. Before that we say that Turkish can have 8-9 inflectional and up to 6 derivational morphology!

Resources for Turkish dependency parsing: introducing the BOUN Treebank and the BoAT annotation tool - Language Resources and Evaluation

In this paper, we introduce the resources that we developed for Turkish dependency parsing, which include a novel manually annotated treebank (BOUN Treebank), along with the guidelines we adopted, and a new annotation tool (BoAT). The manual annotation process that we employed was shaped and implemented by a team of four linguists and five Natural Language Processing (NLP) specialists. Decisions regarding the annotation of the BOUN Treebank were made in line with the Universal Dependencies (UD) framework as well as our recent efforts for unifying the Turkish UD treebanks through manual re-annotation. To the best of our knowledge, the BOUN Treebank is the largest Turkish UD treebank. It contains a total of 9761 sentences from various topics including biographical texts, national newspapers, instructional texts, popular culture articles, and essays. In addition, we report the parsing results of a state-of-the-art dependency parser obtained over the BOUN Treebank as well as two other treebanks in Turkish. Our results demonstrate that the unification of the Turkish annotation scheme and the introduction of a more comprehensive treebank lead to improved performance with regards to dependency parsing.

SpringerLink

Show thread

David Mortensen Nov 20, 2022

@utkuturk Yes, it's really impressive (though not without parallel). What students doing #NLProc often don't realize is that morphologically impoverished languages like English and Chinese are actually outliers. While most languages are not Turkish, their word structure is a lot more complicated than English (and multilingual language technologies need to be developed with this in mind).

Show thread

Utku Turk Nov 20, 2022

@davidmortensen Luckily, I have been seeing a lot of efforts in multilingual and/or morphology-driven models out there!