Mastodawn

Joxean Koret (@matalaz)Aug 15, 2024

Long thread ahead about training a classifier of "good/batch matches" for #Diaphora.

So, the whole idea that I have been working on for quite some time already to try to, somehow, improve matching in Diaphora is the following: Train a model to better determine if a pair of functions in two binaries (ie, a match between a function A in binary X, and function B in binary Y) is correct or not.

Show thread

Joxean Koret (@matalaz)Aug 15, 2024

This is not at all my own idea and this is, basically, the only thing that academia researches as of today: almost every single academic paper published in the last years talking about binary diffing (or, as academia calls it "Binary Code Similarity Analysis") is based on "machine learning" techniques.

Some popular academic examples: DeepBinDiff or BindiffNN. Don't worry if you don't know them. Nobody uses them. At all.

#BinDiff #BinaryDiffing #BinaryCodeSimilarityAnalysis

Show thread

Joxean Koret (@matalaz)Aug 15, 2024

So, the idea is "try to use ML techniques to improve a real world tool, #Diaphora". But how? Basically, binary diffing (ignoring the part about extracting whatever features you want to use for comparison) can summarised in the following steps:

1) Find candidate matches.
2) Rate the matches into "good" or "bad".
3) Choose the best matches and generate a final output.

I haven't found any real world ML help for #1 or #3. But maybe it can help for #2?

Show thread

Joxean Koret (@matalaz)Aug 15, 2024

OK, so now that the idea is clear, "rate functions matches between binaries to try to improve matching in #Diaphora", what can I do to solve this problem? The idea:

* Take a big binaries dataset.
* Export, with Diaphora, every single binary.
* Generate a dataset using as ground truth the symbols of the binaries.
* Train a model with the generated dataset.

Easy. Isn't it? How Hard Can It Be (TM)?

Show thread

robbje

@joxean The hard part is deciding which features to extract. The rest is easy.

Show thread

Joxean Koret (@matalaz)Aug 15, 2024

@robbje I already have that part. I think, more or less. Basically, comparison data of the control flow graphs, cyclomatic complexity, pseudo-code (both textual and AST representations), strongly connected components, loops, constants (text and numbers), as well as compilation units information.

Show thread

Joxean Koret (@matalaz)Aug 15, 2024

@robbje My opinion: the hardest part is not features selection, but training a realistic model with a realistic dataset with my resources.