How to download large datasets of (mostly legit) files to test malware detection tools thanks to CommonCrawl - A short presentation I gave last week at Pass-the-SALT:
https://passthesalt.ubicast.tv/videos/2024-rump-11-how-to-download-large-datasets-of-files-using-commoncrawl/
Slides: https://archives.pass-the-salt.org/Pass%20the%20SALT/2024/slides/PTS2024-RUMP-11-CommonCrawl_Lagadec.pdf
Rump #11: How to download large datasets of files using CommonCrawl

Pass the SALT Archives
In short, I used the tool commoncrawl-fetcher-lite from @tallison :
https://github.com/tballison/commoncrawl-fetcher-lite
GitHub - tballison/commoncrawl-fetcher-lite: Simplified version of a common crawl fetcher

Simplified version of a common crawl fetcher. Contribute to tballison/commoncrawl-fetcher-lite development by creating an account on GitHub.

GitHub
The first step is to pick the right mimetypes to get the files you need:
Get mimetypes-detected.csv from https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
Also pick the crawl ids from that CSV file.
Statistics of Common Crawl Monthly Archives by commoncrawl

The second step is to edit the config file with those parameters. For example this one will download EXE files from the crawl CC-MAIN-2018-34:
Then you can run the tool for a few hours, and it will download thousands of files matching the mimetypes (mostly).
Some post-processing is needed after the download, as shown in the slides:
Other issues to keep in mind when using CommonCrawl: