Later today at #CHR2024, we are going to present our work on #Multilingual #Stylometry!

We isolated the influence of #language on #authorship #attribution #accuracy by translating multiple #corpora into each others' languages while keeping #corpus composition stable.

Interactive showcase: https://showcases.clsinfra.io/stylometry

Full paper: https://ceur-ws.org/Vol-3834/paper9.pdf

This work was developed within the @CLSinfra project in #Trier, #Krakow and #Prague with Artjoms Šeļa, Evgeniia Fileva and Julia Dudar.

Multilingual Stylometry Showcase

My lab, Computational Linguistics at Manitoba, is seeking motivated PhD students for #AI and #NLProc research in computational humour, historical born-digital #corpora, and #Indigenous language technology: https://clam.cs.umanitoba.ca/open-positions
Computational Linguistics at Manitoba (CLAM) - Open positions

So you wanna parse/manipulate some #PDF's, huh!?

Well, you better #test your #software thoroughly or bad things will happen!🧪

So how about "this corpus [which] contains nearly 8 million PDFs gathered from across the web in July/August of 2021":
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

The entire corpus when uncompressed takes up nearly 8 TB!

You can find some more links to different #corpora (even to ones deemed #unsafe!😬) at pdf-association's Github:

https://github.com/pdf-association/pdf-corpora

#Parsing #Testing

SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora

👋 Greetings! 👋

We wanted to remind all #fakespeakers that the #Fakespeak project is still alive and kicking – especially after a long and #fakenews filled summer vacation.

We have some great events and research output coming out in the next few months, including a #linguistics conference, #multilingual fake news #corpora, publications bringing together advanced linguistic features and #transformermodels, and a special issue in Linguistics Vanguard on the language of fake news.

Follow along!

Interestingly, very few psychologists are aware of #linguistic #corpora 📊 and their immense research potential. Platforms like CLARIN-PL offer invaluable data that can significantly enhance our understanding of human behaviour and social interactions. 🤝🗣️ It's time more of us psych folk tapped into these resources to advance our field! 🌟🔍
And another one for fellow linguists interested in compiling #corpora of digital discourse: MastoScraper takes advantage of the Mastodon API to collect toots based on a keyword search.
Here goes, feedback welcome!
#linguistics @linguistics
https://fmoncomble.github.io/mastoscraper/
MastoScraper

MastoScraper

Finally a corpus containing foul language.

Lexical tutor concordance now has a corpus of movie language COCA Movies 1.6m so we can see how language is used actually used therein.

A potentia game changer for corpus linguistics considering the vast number of humans who only use dictionaries to look up swear words?

#corpora

https://www.lextutor.ca/cgi-bin/conc/wwwassocwords.pl?lingo=English&KeyWordFormat=&Gaps=no_gaps&blockers=&store_dic=Eng_Eng&is_refire=true&Fam_or_Word=family&Source=https%3A%2F%2Fwww.lextutor.ca%2Fconc%2Feng%2F&unframed=true&SearchType=lemma&SearchStr=fuck&Corpus=coca_movies.txt&ColloSize=1&SortType=right&AssocWord=nil&AssocSide=either&Maximum=50000&LineWidth=100

Next week, we'll be discussing how to archive and research social media data on a large scale "After Twitter". Very excited to see what comes out of this conference, and also the following data sprint delving into huge German Twitter corpora.
https://www.dnb.de/twittertagung
#AfterTwitter #corpora #research
twittertagung2024

Deutsche Nationalbibliothek

interesting publication on medieval Latin text corpora by @TimGeelhaar : 🔖 Geelhaar, Tim. „Hamsterrad oder Himmelsleiter? Oder warum die Digitalisierung so endlos scheint“. Application/epub+zip,application/pdf, 2024. https://doi.org/10.15499/KDS-005-016.

#Latin #Neolatin #Corpora #OpenAccess

#Eduhub days 2024 at #ZHAW and I cannot be there 😢 🩼

If you go, stop by at the marketplace—in the afternoon my colleague Maren Runte will show our work on creating a learning space for working with linguistic #corpora (to be released later this year)

#EduhubDays24 #DigitalLinguistics

https://eduhubdays2024.events.switch.ch

eduhub days 2024