Help needed. Looking for a nice (mixed and possibly messy) set files to download to test some dp tools against.
#digipress

@Bryony_Hooper

We've got files!

Bug-tracker corpus? (Attachments we crawled from bug trackers for open source parsers)

https://corpora.tika.apache.org/base/docs/bug_trackers/

Index of /base/docs/bug_trackers

SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora

iPres2023_Data-Set – Google Drive

Index of /format-corpus/pdfCabinetOfHorrors

@tallison @Bryony_Hooper perhaps you can consider a PR here Tim?

Also, the digital preservation awesome list may be of use Byrony:

https://github.com/digipres/awesome-digital-preservation#find-test-files

GitHub - digipres/awesome-digital-preservation: Carefully curated list of awesome digital preservation resources.

Carefully curated list of awesome digital preservation resources. - GitHub - digipres/awesome-digital-preservation: Carefully curated list of awesome digital preservation resources.

GitHub
Add some corpora by tballison · Pull Request #10 · digipres/awesome-digital-preservation

Carefully curated list of awesome digital preservation resources. - Add some corpora by tballison · Pull Request #10 · digipres/awesome-digital-preservation

GitHub