Mastodawn

Tim Allison May 16, 2023

Got #PDF? 8 million PDFs/8TB. Derived from #CommonCrawl. We refetched 2 million truncated files.

https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/

New large-scale PDF corpus now publicly available – PDF Association

DigitalPebble Ltd

@tallison what did you use for refetching?

Tim Allison May 21, 2023

@digitalpebble sadly homegrown one off: https://github.com/tballison/file-observatory/tree/main/commoncrawl-fetcher

If I were to do it again, I’d use #ApacheNutch or #StormCrawler

file-observatory/commoncrawl-fetcher at main · tballison/file-observatory

Single server/laptop grade file-observatory. Contribute to tballison/file-observatory development by creating an account on GitHub.

GitHub