Got #PDF? 8 million PDFs/8TB. Derived from #CommonCrawl. We refetched 2 million truncated files.

https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/

New large-scale PDF corpus now publicly available – PDF Association

Many thanks to @xchatty digitalcorpora.org and #AWSOpenDataSets for publishing this set!
@xchatty many thanks to Peter Wyatt #PDFAssociation for collaboration in developing this corpus.
SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED) – Digital Corpora

@xchatty and of course many thanks to #CommonCrawl! Cc @sebnagel