Mastodawn

Philippe Lagadec Jul 11, 2024

How to download large datasets of (mostly legit) files to test malware detection tools thanks to CommonCrawl - A short presentation I gave last week at Pass-the-SALT:
https://passthesalt.ubicast.tv/videos/2024-rump-11-how-to-download-large-datasets-of-files-using-commoncrawl/
Slides: https://archives.pass-the-salt.org/Pass%20the%20SALT/2024/slides/PTS2024-RUMP-11-CommonCrawl_Lagadec.pdf

Rump #11: How to download large datasets of files using CommonCrawl

Pass the SALT Archives

Show thread

Philippe Lagadec Jul 11, 2024

In short, I used the tool commoncrawl-fetcher-lite from @tallison :
https://github.com/tballison/commoncrawl-fetcher-lite

GitHub - tballison/commoncrawl-fetcher-lite: Simplified version of a common crawl fetcher

Simplified version of a common crawl fetcher. Contribute to tballison/commoncrawl-fetcher-lite development by creating an account on GitHub.

GitHub

Show thread

Philippe Lagadec Jul 11, 2024

The first step is to pick the right mimetypes to get the files you need:
Get mimetypes-detected.csv from https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
Also pick the crawl ids from that CSV file.

Statistics of Common Crawl Monthly Archives by commoncrawl

Show thread

Philippe Lagadec

The second step is to edit the config file with those parameters. For example this one will download EXE files from the crawl CC-MAIN-2018-34:

Show thread

Philippe Lagadec Jul 11, 2024

Then you can run the tool for a few hours, and it will download thousands of files matching the mimetypes (mostly).
Some post-processing is needed after the download, as shown in the slides:

Show thread

Philippe Lagadec Jul 11, 2024

Other issues to keep in mind when using CommonCrawl: