#CLS #CCLS24 #Journal #JCLS https://jcls.io
This paper presents BookNLP-fr: the adaptation to French of BookNLP, an existing NLP pipeline tailored for literary texts in English. We provide an overview of the challenges involved in the adaptation of such a pipeline to a new language: from the challenges related to data annotation up to the development of specialized modules of entity recognition and coreference. Moving beyond the technical aspects, we explore practical applications of BookNLP-fr with a canonical task for computational literary studies: subgenre classification. We show that BookNLP-fr provides more relevant and – even more importantly – more interpretable features to perform automatic subgenre classification than the traditional bag-of-words approach. BookNLP-fr makes NLP techniques available to a larger public and constitutes a new toolkit to process large numbers of digitized books in French. This allows the field to gain a deeper literary understanding through the practice of distant reading.
Seneca's authorship of Octavia and Hercules Oetaeus is disputed. This study employs established computational stylometry methods based on character n-gram frequencies to investigate this case. Based on a Principal Component Analysis (PCA) of stylistic similarities within the Senecan corpus, Octavia and Phoenissae emerge as outliers, while Hercules Oetaeus only stands out when the text is split in half. Subsequently, applying PCA and Bootstrap Consensus Trees (BCT) to a corpus of distractor texts, both disputed plays align with the Senecan cluster/branch. The General Impostors method confidently reports Seneca as the author of the disputed plays under various scenarios. However, upon closer examination of text segments, indications of mixed authorship arise. Based on computational stylometry, it appears that the disputed were in large part, but not wholly, written by Seneca.
Computational studies of literature use proxies like sales numbers, human judgments, or canonicity to estimate literary quality. However, many quantitative use one such measure as a gold standard without fully reflecting on what it represents. We examine the interrelation of 14 proxies of literary quality in novels published in the US from the late 19th to 20th century, distinguishing between expert-based judgments (e.g., syllabi, anthologies) and crowd-based ones (e.g., GoodReads ratings). We show that works favored in expert-based judgments often score lower on GoodReads, while award-nominated works tend to circulate more widely in libraries. Generally, two main kinds of `quality perception' emerge as we map the literary judgment landscape: One associated with canonical literature and one with more popular literature. Additionally, prestige in genre literature, reflected in awards like the Hugo Award, forms a distinct category, more aligned with popular than canonical proxies.