Long thread ahead about training a classifier of "good/batch matches" for #Diaphora.

So, the whole idea that I have been working on for quite some time already to try to, somehow, improve matching in Diaphora is the following: Train a model to better determine if a pair of functions in two binaries (ie, a match between a function A in binary X, and function B in binary Y) is correct or not.

This is not at all my own idea and this is, basically, the only thing that academia researches as of today: almost every single academic paper published in the last years talking about binary diffing (or, as academia calls it "Binary Code Similarity Analysis") is based on "machine learning" techniques.

Some popular academic examples: DeepBinDiff or BindiffNN. Don't worry if you don't know them. Nobody uses them. At all.

#BinDiff #BinaryDiffing #BinaryCodeSimilarityAnalysis

So, the idea is "try to use ML techniques to improve a real world tool, #Diaphora". But how? Basically, binary diffing (ignoring the part about extracting whatever features you want to use for comparison) can summarised in the following steps:

1) Find candidate matches.
2) Rate the matches into "good" or "bad".
3) Choose the best matches and generate a final output.

I haven't found any real world ML help for #1 or #3. But maybe it can help for #2?

OK, so now that the idea is clear, "rate functions matches between binaries to try to improve matching in #Diaphora", what can I do to solve this problem? The idea:

* Take a big binaries dataset.
* Export, with Diaphora, every single binary.
* Generate a dataset using as ground truth the symbols of the binaries.
* Train a model with the generated dataset.

Easy. Isn't it? How Hard Can It Be (TM)?

So, first problem: which dataset? This is not an easy problem to begin with and this is the very first stone in the path.

Long history short: I ended up using a Cisco Talos created dataset.

https://github.com/Cisco-Talos/binary_function_similarity

PS: Picture related to this and other problems in academic research in this field.

PS2: The paper https://www.researchgate.net/publication/361691168_Revisiting_Binary_Code_Similarity_Analysis_using_Interpretable_Feature_Engineering_and_Lessons_Learned

#Cisco #Talos #BinaryFunctionSimilarity #Datasets

GitHub - Cisco-Talos/binary_function_similarity

Contribute to Cisco-Talos/binary_function_similarity development by creating an account on GitHub.

GitHub

Now, it's as easy for me as telling #Diaphora to just analyse with IDA all binaries in the dataset and export the features Diaphora uses. After a couple of days, it finished and I had a bunch of Diaphora exported .sqlite files.

Cool. Next step? Generate a dataset of good and bad matches. And, oh boy... I think I'm going to just ignore the remaining space in this toot to go for the next one to explain...

Let's suppose that the dataset I mention was just 1,000 binaries. Now, let's suppose that each binary has only 100 functions. To generate the good matches, I would need to take the binaries, using the symbols, generate some row with comparison data of the functions' features extracted by #Diaphora and add a new column with the value '1' to say "this row has the data for a good match".

Cool, this "just" means that I have to do 100 x 1,000 x 1,000 operations for that fictional dataset.

Now, I also need some bad matches. How many? Lol, like I know. At first I was generating 10 bad matches for every single good match. For a very small subset of the Cisco Talos dataset, it was taking (paralellized) some days to process just 256 binaries. The ETA the tool calculated was something along the lines of 770 days. Then I reduced the number of bad matches per good match to 5, performed some optimisations here and there, and now it takes "just" ~24 hours for 256 binaries.

Now, I have a dataset that I can use to train a model that I can use for #Diaphora. But this dataset is:

* A rather small subset of the Cisco Dataset.
* Only has Linux binaries for 3 architectures (x86, arm and mips, for 32 and 64) and 1 compiler (gcc).

In order to "properly" create a good dataset, it would need to...

* Include at least the 3 most used compilers: gcc, clang and msvc.
* Include binaries for the 3 most used operating systems.

Which means... a lot of binaries.

So, how can I do this? Me, just a random open source developer, I have no option to do this at all. I don't have neither the required hardware resources at my "office" (at my home, my house), neither the money to "rent" the infrastructure from some big company, like Google Cloud, AWS, Azure, etc... Which means that, for non big companies, this project is not practical at all.

So, will I be able to finally provide trained models for #Diaphora? I really doubt so. But I will provide the tools.

I will explain in much more details what I'm working on, and I will also release the tools, during the upcoming @44CON conference in London.
#44con #Diaphora
@joxean The hard part is deciding which features to extract. The rest is easy.
@robbje I already have that part. I think, more or less. Basically, comparison data of the control flow graphs, cyclomatic complexity, pseudo-code (both textual and AST representations), strongly connected components, loops, constants (text and numbers), as well as compilation units information.
@robbje My opinion: the hardest part is not features selection, but training a realistic model with a realistic dataset with my resources.