In a methods / #DigitalHumamities class next semester, I want to cover basic corpus creation. Especially, I’ll probably focus on #OCR/#HTR/#ATR and #WebScraping. I find it incredibly hard to find good papers that can serve as a general introduction into these topics. All I find are either practical tutorials, or very specialized papers about specific approaches. Do you have any favorite readings about how to get to a text corpus in DH in the first place? Please share!
@felwert I gather information on web texts in #DigitalHumanities contexts on this page, this could be another starting point with references:
https://trafilatura.readthedocs.io/en/latest/compendium.html
Compendium: Web texts in linguistics and humanities — trafilatura 1.12.2 documentation

This page summarizes essential information about building and operation of web text collections. It primarily addresses concerns in linguistics and humanities.