I am usually full of praise for the mass digitisation of cultural heritage led by the #NationalLibraryOfNorway.

However, the #OCR of the newspapers' full text is abysmal.

When I search for "ex libris", I have learned to do it by looking for "ex" – a seldom-used word in Norwegian – & manually checking.

Some examples of what a simple search for "ex" yields. In combination with "libris", the latter word is almost always misread.

80% of the time, ex is the misread version of "er" (= is).

🧵 1/
An example:

In the newspaper Aftenposten, between 1.1.1935 and 31.12.1935, the search term "ex" yields 393 hits. I have to go through these manually, i.e., I look at the "hit in context" that nb.no provides and judge, based on the following sequence of letters or characters, whether this might be a true hit for "exlibris". Within the 1-year timeframe, there was only one hit with sufficient OCR quality.

#OCR #NationalLibraryOfNorway #Aftenposten #Search

🧵 2/

Will I catch all "exlibris" with this technique?
What if the "ex" is as often misread as "er", then the other way around? That is hard to tell since there are millions of "er" (Norwegian verb "to be" in its conjugated form).

Hard to tell. What if I look for "libris"?

Let's do the test? There were only two hits this time, one of which was the same as the manually detected one.

The other, however, escaped my eagle eyes.

🧵 3/

This means I have to double-check with "libris" every time.

That will, of course, not help with occasions where the OCR misreads both words.

Exhausting, but still the reality of much historical research.

I can at least be glad that I can search the entire corpus of Aftenposten in full text in one place! Otherwise, the time required to browse 50+ years of newspapers for articles on exlibris manually would not be defensible.

🧵 4/

Last example:

Potential hit from Aftenposten 1934: This time, the search for "libris" yielded one result, and the search for "ex" yielded 381.

The image shows a good candidate for "ex libris", where "libris" is misread.