What say we run 'file' and #siegfried against #ApacheTika's 600k 'application/octet-stream's in the most recent #CommonCrawl crawl?
Anyone else want to join in the fun?
3) I reran Tika, 'file' and #siegfried on all the files.
You can explore the mimes via datasette: https://corpora.tika.apache.org/datasette
Or, download the whole sqlite db: https://corpora.tika.apache.org/base/share/tika-mimes-20230714.db.gz
I mean, who wouldn't want to spend the weekend looking for differences btwn #siegfried and #file and #ApacheTika?!
What say we run 'file' and #siegfried against #ApacheTika's 600k 'application/octet-stream's in the most recent #CommonCrawl crawl?
Anyone else want to join in the fun?