@Ange

3.3K Followers
196 Following
786 Posts
Reverse engineer, file formats expert.
Corkami, CPS2Shock, PoC||GTFO, Sha1tered, Magika...
Security engineer @ Google. He/him.
Githubhttps://github.com/angea
Githubhttps://github.com/corkami
Pronounshe/him
On my way #39c3 #ICE612

Oh dear the entire https://www.lyonlabs.org site is offline *and* excluded from archive.org.

It's a massive archive of vintage and modern GEOS and C64 material a lot of it seemingly not found elsewhere.

My relative is looking for a 39C3 ticket.
No scammers please ;)

To check if a file starts with MZ or GIF, just use file/libmagic.
You don't need AI or Magika for that.
TrID has a lot of heuristics, but a lot of false positives.

Magika is useful in different ways, across binary and source types, and is quite fast. But not useful against weird or adversary files.

Magika is a fast file type identifier that covers many file types, binary formats or source texts.
It's not made to detect adversarial attacks.
It's useful for different things that classic binary scanning can't do at this speed.

Magika was trained on all the file types with enough available samples.

Weird files are out of scope of Magika. It just wasn't trained on them.

It's trivial to inject some data in a file and keep it functional (w/ my tool Mitra, for example).
So take a JPG, inject a lot of JavaScript data, and ...guess what ?

Check it out: https://github.com/corkami/mitra

GitHub - corkami/mitra: A generator of weird files (binary polyglots, near polyglots, polymocks...)

A generator of weird files (binary polyglots, near polyglots, polymocks...) - corkami/mitra

GitHub

Of course, it's possible to create weird files that will fool Magika and other tools.
Polymocks, polyglots...

I made quite a few - check my CCC talk last year:
https://speakerdeck.com/ange/fearsome-file-formats-18374bc4-b3f2-429f-862e-2177ab4d7aae

Fearsome File Formats

Presented at 38C3 in Hamburg on the 28th December 2024. Video recording: https://media.ccc.de/v/38c3-fearsome-file-formats With so many open-sou…

Speaker Deck
Magika uses the first and last kilobytes of the files.
That way, if the file is slightly corrupted, the filetype might still be properly identified.
Magika returns several file types if needed.
It's one of its advantages, but a double-edged sword.

So file contents are used to determine the file type.
To check if the file starts with '\x7FELF', 'MZ' or 'GIF', you don't need IA.

But some file formats don't start with a clear 'magic' signature at offset zero.
And what if you also want to tell the apart C++, RUST and HTML ?
No magic for source files.

To identify file types, the worst way are file extensions:
the extension is stored in the filesystem entry, not in the file content.
It can be lost, modified, variable...

Almost all file formats are known under several file extensions:
.JPG/.JPEG, .ZIP/.APK/.DOCX, .EXE/.DLL, .ELF/.SO ...