Exciting! Overnight, the text recognition job was completed and now I get to see how good the transcription is! #EthicaComplementoria #Transkribus #HTR
Observation so far (manually checked 4 pages): German text looks good, some issues w/ Latin phrases (they're in cursive font); issues w/ uppercase letters; almost all numbers are erroneous; some issues w/ superscript letters not being recognised, some letters missing or doubled.
The Latin phrases, uppercase & numbers are due to underrepresentation in the training set, I suppose. Superscripts are far above the letter, so therefore might not be recognised; doubles/missing letters? Don't know yet!
Doing the extra round of manual quality checking turns out to be quite fun!
The numbers are just ridiculously wrong: a 38 is read as 24 and 42 as 25; I don't know what's happening. Interestingly, it always identifies numbers as numbers, not letters!
Generally, the text is recognised very well; there are minor issues, mainly when #Transkribus occasionally draws too short lines, so slim letters and punctuation are missed.
I worked w/ on and off today, but I almost got another 25 pages proofread!
While proofreading, I observed 2 distinct ways #Typography is realised in the 1674 #EthicaComplementoria print from Copenhagen. This can happen when more than one worker is typesetting the print. Let's call them Villads & Emil. Villads has been doing this job for a while now; he routinely typesets all #Latin words w/ a beautiful cursive #Antiqua font. Emil, who's already having difficulty reading the #German text, uses #Fraktur for everything & only sets Latin proverbs in cursive. #BookHistory
This phenomenon is highlighted because the #HTR model I've been training to recognize the text had too few characters in cursive in the training set. So they get misread, and I have to correct them, thus becoming hyperaware of these differences in the typesetting.
When I'm done, I will reconstruct the sheets and printing order to look at the distribution of spelling and other errors and typographical conventions. Exciting! #BookHistory #PrintHistory #AnalyticBibliography
Moving back to the #HTR issues, we have phenomena like these (shown in the image): The layout detection model draws 'short' lines when the text is warped in the book fold. This leads to especially slim letters and punctuation not getting recognized. When I do the corrections, I will also extend the lines to include these characters. Hopefully, it will improve the recognition! #Transkribus #EthicaComplementoria
And that's it! Sent off the new training set and hopefully improve the issues I have encountered!
I will also use this model for the 1728 print. It's a different printing press, but overall, the two prints are very much alike, and so is the #Fraktur type they use. #Transkribus #EthicaComplementoria #HTR #EarlyModern #PrintHistory