Long thread ahead about training a classifier of "good/batch matches" for #Diaphora.

So, the whole idea that I have been working on for quite some time already to try to, somehow, improve matching in Diaphora is the following: Train a model to better determine if a pair of functions in two binaries (ie, a match between a function A in binary X, and function B in binary Y) is correct or not.

This is not at all my own idea and this is, basically, the only thing that academia researches as of today: almost every single academic paper published in the last years talking about binary diffing (or, as academia calls it "Binary Code Similarity Analysis") is based on "machine learning" techniques.

Some popular academic examples: DeepBinDiff or BindiffNN. Don't worry if you don't know them. Nobody uses them. At all.

#BinDiff #BinaryDiffing #BinaryCodeSimilarityAnalysis

So, the idea is "try to use ML techniques to improve a real world tool, #Diaphora". But how? Basically, binary diffing (ignoring the part about extracting whatever features you want to use for comparison) can summarised in the following steps:

1) Find candidate matches.
2) Rate the matches into "good" or "bad".
3) Choose the best matches and generate a final output.

I haven't found any real world ML help for #1 or #3. But maybe it can help for #2?

OK, so now that the idea is clear, "rate functions matches between binaries to try to improve matching in #Diaphora", what can I do to solve this problem? The idea:

* Take a big binaries dataset.
* Export, with Diaphora, every single binary.
* Generate a dataset using as ground truth the symbols of the binaries.
* Train a model with the generated dataset.

Easy. Isn't it? How Hard Can It Be (TM)?

@joxean The hard part is deciding which features to extract. The rest is easy.
@robbje I already have that part. I think, more or less. Basically, comparison data of the control flow graphs, cyclomatic complexity, pseudo-code (both textual and AST representations), strongly connected components, loops, constants (text and numbers), as well as compilation units information.
@robbje My opinion: the hardest part is not features selection, but training a realistic model with a realistic dataset with my resources.