Mastodawn

David Bamman Jan 13, 2023

New #OpenAccess #NLProc paper and dataset:
OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization.
https://github.com/andreasvc/openboek

GitHub - andreasvc/openboek: The OpenBoek corpus

The OpenBoek corpus. Contribute to andreasvc/openboek development by creating an account on GitHub.

GitHub

Show thread

Andreas Jan 13, 2023

Inspired by @dbamman 's LitBank, this paper presents a creative commons licensed dataset of 19th century Dutch literature from Project Gutenberg with various layers of annotations. A distinguishing characteristic is that we focus on annotating long fragments (10k+ tokens). This is considerably longer than most other coreference datasets.

Show thread

Andreas Jan 13, 2023

Dutch spelling has undergone a number of changes compared to 19th century Dutch spelling (English is much more resistant to spelling changes). Gertjan van Noord and I introduce a simple rule-based spelling normalization tool which is shown to reduce the number of downstream errors made by NLP tools designed for contemporary Dutch spelling.

Show thread

Andreas Jan 13, 2023

A demo of the coreference system and spelling normalization evaluated in the paper is available at https://demo.let.rug.nl/andreas/coref/