Marcus Botacin

303 Followers
79 Following
126 Posts
CS Assistant Professor at Texas A&M @TAMUEngineering; PhD @SECRET_UFPR
@UFPR; CE/CS Master @Unicamp_IC; Interested in #Malware Analysis & Reverse Engineering. More about me: marcusbotacin.github.io/
Proud advisor moment! Congrats Nhat Nguyen for successfully defending his MSc thesis! Nhat is my first advised student to graduate! Thx Dr. Peeples and Dr. Hamilton for participating in the committee. Wait for some cool papers to come on automatic YARA rule generation!
Once again, pseudo-labels help to mitigate the effects of limited queue sizes, which is a constraint of many real-world pipelines!
Where the delays come from? In addition to sandbox delays, another source of delays are limited buffer sizes. When the buffer is limited, not all samples are processed. Limiting the number of samples considered in the retraining process cause the same effect as label delays.
A drawback is that the pseudo-labels should be short-lived to be beneficial. Updating the secondary classifier when drifting is detected is required, otherwise the outdated pseudo-label generates start to poison the main classifier, degrading its performance.
Using pseudo labels is beneficial not because it increases the detection rate, but because it affects the drift dynamics. Different drift points are observed when pseudo and delayed labels are used.
The key to mitigate the impact of true delayed labels (e.g., from a sandbox) is to have a mechanism (e.g., a more powerful, cloud-based, static classifier) to provide temporary labels (pseudo-labels) that can help reducing the response time, which has significant effects.

Another unrealistic assumption is that the true labels (groundtruth) are immediately available, which is not case in the real world. If the labels are delayed, the retraining is delayed, and the exposure grows.

If the labels are delayed by a long time, drift retraining is not effective anymore, and the performance degrades to the case with no drift detection.

The problem with most evaluations is that they assume ideal situations and real-world is much hard. With a traditional metric, we cannot understand the real impact that restrictions cause, but with the new metric we can highlight its effect.

A first restriction is the amount of data one can keep on the history queue to retrain the models. If we consider a limited queue (triggered only by drift warning) rather than the entire queue, there is a noticeable impact in the long-term.

We believe that we should have a cumulative view bc the same sample keeps threatening users over time if detection is missed. We propose the exposure metric, that highlights the effect of FNs. It makes clear that concept drift is a very efficient approach in the long-term.
It is known that malware classifiers drift over time and we need to retrain them, but what is a good way to visualize this phenomenon? Just looking by how traditional metrics (e.g., precision) vary over time do not tell us the whole story.