Got #PDF? 8 million PDFs/8TB. Derived from #CommonCrawl. We refetched 2 million truncated files.
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
Got #PDF? 8 million PDFs/8TB. Derived from #CommonCrawl. We refetched 2 million truncated files.
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
@digitalpebble sadly homegrown one off: https://github.com/tballison/file-observatory/tree/main/commoncrawl-fetcher
If I were to do it again, I’d use #ApacheNutch or #StormCrawler