In #AdversarialML, targeted training set attacks are one of the biggest threats to #MachineLearning -- highly effective and hard to detect!

In a #NewPaper at #CCS2022 this week, Zayd Hammoudeh and I show how you can use #InfluenceEstimation to detect, understand, and stop these attacks!

Our methods work against backdoor and poisoning attacks, in vision/test/audio domains, and against adaptive attackers.

https://dl.acm.org/doi/10.1145/3548606.3559335

Title: "Identifying a Training-Set Attack's Target Using Renormalized Influence Estimation"

Key ideas:
💡 Training attacks are *highly influential* to their targets
💡 Targets have *anomalous influence distributions*
💡 Attacks are the targets’ *top influences*

In other words: Stopping training set attacks is an influence estimation problem!

Unfortunately, standard influence estimation methods didn't work!

We set up a toy task where finding influential points should be easy:

Class 1: Frogs (CIFAR)
Class 2: Airplanes (CIFAR) + zero digits (MNIST)

Test images: zero digits

Most influential training instances should be zero digits, right? But current methods select frogs and airplanes instead!

Why do standard methods fail? Because those frog/airplane images have very large gradients, so most methods overestimate their influence.

How do we fix this? If we first *normalize* by gradient magnitude, then we find images specific to the target instead of general outliers.

Under this *renormalized* influence estimation, non-attack instances (blue) follow a normal distribution but attack instances (red) show up as extreme outliers in influencing their target.

Do attack instances also influence non-targets (e.g., as collateral damage)? Yes, but not nearly as much.

Against non-targets, attack instances (red) are influential, but within the range of normal influence values (blue).

Therefore, the trick to stopping attacks is to find the targets, and the trick to finding targets is to look for unusual influence patterns!

When we apply this idea to multiple attacks in multiple domains, our methods (red and blue) consistently identify the most attack targets (first image) and the most attack instances (second image).

Check out the paper for additional results on attack mitigation, adaptive adversaries, and more.

(Spoiler: it works great!)

So, what’s the catch?
— Target must exist: We can’t identify attack until we see its target.
— Compute cost: It's expensive to compute so many pairwise influences.

Nonetheless, we hope this establishes a new baseline for stopping training set attacks.

Questions? Comments? Ideas for future work? Collaborations? Let us know!

https://arxiv.org/abs/2201.10055