#Geoparsing is about identifying and locating place references in texts. Does it matter which language is being geoparsed? In a new @ijgis article, we examine that question through building a geoparser for the morphologically complex Finnish. Co-authored with my wonderful supervisors @tuuli & @tuomo

The article: https://doi.org/10.1080/13658816.2024.2369539
🧵

We argue that languages are not alike in terms of their structure and the availability of (labelled) data. For example, Finnish place names may need to be lemmatised (Helsingissä -> Helsinki) to query databases effectively.

Check out this geoparsing pipeline of three tasks and three possible outcomes for each place name. Note that lemmatisation errors are a major source of noise and failures.

To tackle geoparsing for Finnish, we draw from existing resources where possible. We fine-tune a Finnish language model with a modified named entity corpus and set up an instance of a Pelias geocoder.

As for the resources that are missing, such as labelled Finnish corpus for evaluating geoparsers? We annotated tweets and news articles – both corpora are shared openly.

We also ran into some problems with the existing evaluation measures that tend to measure the distance between the ground-truth location and the prediction. However, that distance may also be caused reasons unrelated to the geoparser, such as database mismatches. We propose using polygon / bbox intersection.

Finally, the big question: does it work? Yeah, kinda. Despite the simple architecture, the geoparser performed well enough on the two corpora.

If you want to check out the geoparser: https://github.com/DigitalGeographyLab/Finger-geoparser
I also put out a blog post: https://blogs.helsinki.fi/digital-geography/2024/07/05/new-article-out-geographical-and-linguistic-perspectives-on-developing-geoparsers-with-generic-resources/

GitHub - DigitalGeographyLab/Finger-geoparser: Geoparser for extracting and locating place names from Finnish texts

Geoparser for extracting and locating place names from Finnish texts - DigitalGeographyLab/Finger-geoparser

GitHub
This work wouldn’t exist without a host of #FOSS tools and the hard work that goes into them. To name a few: #spaCy, Pelias, @qgis, (Geo)Pandas, LibreOffice suite, and @SankeyMATIC. Kudos to the developers and other contributors to these projects!
Open data and the infrastructure to share and update it is likewise crucial. My thanks to OpenStreetMap and @whosonfirst contributors and developers, as well as the good folks at TurkuNLP for Finnish BERT and the NER & TDT corpora. Thanks also to CSC for computational resources, Kone Foundation for funding project #MOBICON and folks at @digigeolab for the good times.
On a personal note, it’s relieving and joyful to have the first paper of my PhD out. Got to keep feeding the machine – another one coming hopefully later this year – but now I’m off for holidays! 😎