Got #PDF? 8 million PDFs/8TB. Derived from #CommonCrawl. We refetched 2 million truncated files.

https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/

New large-scale PDF corpus now publicly available – PDF Association

@tallison what did you use for refetching?
file-observatory/commoncrawl-fetcher at main · tballison/file-observatory

Single server/laptop grade file-observatory. Contribute to tballison/file-observatory development by creating an account on GitHub.

GitHub