What does it mean to evaluate ML model robustness for a security model?

It means measuring how the model performs against an adversary and under distribution shift, not just on a held-out test set. A malware classifier can hit 99% test accuracy and still be flipped on nearly every sample by a small, feasible perturbation. Robustness evaluation reports robust accuracy (accuracy on adversarial inputs within a stated perturbation budget) and the attack success rate (fraction of correctly classified inputs an attacker can flip), against a named attack. It maps to MITRE ATLAS evade-ML-model behavior (AML.T0015, crafting in AML.T0043) and the threat classes in NIST AI 100-2.

Which tools do I use to evaluate adversarial robustness?

The Adversarial Robustness Toolbox (ART) from Trusted-AI is the most complete and works across scikit-learn, PyTorch, TensorFlow, and XGBoost, with evasion, poisoning, extraction, and inference attacks. For neural-network evasion specifically, AutoAttack is the standard parameter-free ensemble for reporting robust accuracy, and RobustBench publishes standardized leaderboards. Foolbox and CleverHans focus on gradient evasion. For non-differentiable models like random forests, use decision-based black-box attacks (HopSkipJump, ZOO, Boundary) since you only have query access to predictions.

Why is clean test accuracy not enough for a security model?

Because an attacker chooses the input. A held-out test set samples the same distribution the model trained on, so high test accuracy only tells you the model works when nobody is trying to defeat it. Security models face an adversary who searches for the specific input that breaks the decision, which is exactly what an evasion attack does. A model with 99% clean accuracy and near-zero robust accuracy at a feasible perturbation budget is, for security purposes, not working.

What is the difference between feature-space and problem-space attacks?

A feature-space attack perturbs the numeric feature vector directly. That is fine for measuring a decision boundary, but the result may not correspond to a real input an attacker can produce. A problem-space attack perturbs the actual artifact (a PE file, a URL, a network flow) under domain constraints: the malware still has to parse and execute, the URL still has to resolve. Robustness numbers from unconstrained feature-space attacks overstate the threat in some domains and understate it in others. Pierazzi et al. (IEEE S&P 2020) formalized the gap; evaluate in the problem space when your threat model is real malware.

Can a model look robust when it is not?

Yes. A single weak attack, or a model that masks gradients, can produce inflated robust accuracy. Athalye et al. (2018) showed several published defenses gave a false sense of security because the evaluation attack could not find adversarial examples that existed. Defend against this by using a strong ensemble like AutoAttack, adding a black-box decision-based attack as a sanity check (it does not rely on gradients), and confirming attack success climbs toward 100% as you raise the perturbation budget. If it does not, suspect the evaluation, not the model.

How to Evaluate ML Model Robustness for Security Use Cases

A malware classifier that scores 99% on a held-out test set tells you one thing: the model works when nobody is trying to defeat it. For a security model, that is the uninteresting case. The attacker picks the input. The right question is not “how accurate is it on the test set,” but “how much accuracy survives an adversary who is searching for the input that breaks it.”

That number is almost always lower, and for a lot of deployed detection models it is close to zero. Evaluating robustness means measuring performance under adversarial perturbation and under distribution shift, then reporting it honestly. Here is how to do it.

Robustness Is Performance Under an Adversary, Not on a Holdout

Standard evaluation samples the same distribution the model trained on. A robustness evaluation assumes an adversary who optimizes the input against your model. The two metrics that matter:

Robust accuracy: accuracy on adversarial examples constrained to a stated perturbation budget (for example, L-infinity epsilon = 0.03 on normalized features). Report the budget. Robust accuracy with no epsilon is meaningless.
Attack success rate: of the inputs the model originally classified correctly, the fraction an attacker can flip within that budget. This is the operational number: it is the share of detections an evader defeats.

These map directly onto a threat model. MITRE ATLAS tracks evading a deployed model as AML.T0015 with adversarial input crafted in AML.T0043. NIST AI 100-2 gives the taxonomy and vocabulary. Decide which of those your model has to withstand before you run anything.

Pick the Right Attack for Your Model

The attack you can run depends on what access you assume and whether the model is differentiable.

White-box, differentiable (neural nets): gradient attacks. Fast Gradient Sign Method (FGSM) for a one-step baseline, Projected Gradient Descent (PGD) for an iterative one. For a defensible robust-accuracy number, use AutoAttack, a parameter-free ensemble (APGD-CE, APGD-DLR, FAB, Square) built specifically to avoid the weak-attack inflation discussed below.
Black-box, query access only (random forests, gradient-boosted trees, anything behind an API): decision-based attacks that need only the predicted label. HopSkipJump, ZOO, and Boundary attacks estimate a direction to the boundary from query responses. This is the realistic setting for most production security models, where an attacker hits an inference endpoint and never sees gradients.

Most security classifiers in the field are tree ensembles on tabular features, which are not differentiable, so the black-box path is usually the honest one.

A Minimal Evasion Evaluation with ART

The Adversarial Robustness Toolbox (ART) wraps your trained model and runs the attacks. Here is a complete evasion evaluation against a scikit-learn classifier using the decision-based HopSkipJump attack, which needs only predict():

import numpy as np
from art.estimators.classification import SklearnClassifier
from art.attacks.evasion import HopSkipJump

# clf: an already-trained sklearn classifier (e.g. RandomForestClassifier)
# X_test, y_test: held-out evaluation data, features scaled to [0, 1]
classifier = SklearnClassifier(model=clf)

clean_pred = clf.predict(X_test)
clean_acc = (clean_pred == y_test).mean()

# only attack inputs the model currently gets right
correct = clean_pred == y_test
X_eval, y_eval = X_test[correct], y_test[correct]

attack = HopSkipJump(classifier=classifier, targeted=False,
                     norm=np.inf, max_iter=50, max_eval=1000)
X_adv = attack.generate(x=X_eval)

adv_pred = clf.predict(X_adv)
robust_acc = (adv_pred == y_eval).mean()          # accuracy under attack
success_rate = (adv_pred != y_eval).mean()        # fraction flipped
linf = np.abs(X_adv - X_eval).max(axis=1)         # budget actually used

print(f"clean accuracy:   {clean_acc:.3f}")
print(f"robust accuracy:  {robust_acc:.3f}")
print(f"attack success:   {success_rate:.3f} at median Linf {np.median(linf):.3f}")

Run this at several perturbation budgets and plot robust accuracy against epsilon. A useful model degrades gracefully. A model whose robust accuracy collapses to near zero at a perturbation an attacker can trivially produce is not providing the protection its test accuracy implies.

Report Robust Accuracy, Not the Strongest Attack You Could Beat

The most common evaluation failure is picking a weak attack and reporting the survivable number. Athalye, Carlini, and Wagner showed in Obfuscated Gradients Give a False Sense of Security (ICML 2018) that several published defenses were not robust at all; their evaluations simply used attacks that could not find the adversarial examples that existed. Gradient masking does the same thing quietly.

Three habits prevent this:

Use a strong ensemble (AutoAttack) for the headline robust-accuracy number, not a single FGSM pass.
Add a black-box decision-based attack as an independent check. It does not use gradients, so it cannot be fooled by gradient masking. If the black-box attack succeeds where the white-box attack failed, your white-box evaluation was broken.
Sanity check the curve: attack success should approach 100% as the budget grows. If raising epsilon does not raise success, suspect the evaluation.

For image models there is a shortcut to a credible number: RobustBench publishes standardized AutoAttack leaderboards and pretrained robust models you can compare against.

Constraints: Feature-Space Wins That Don’t Survive Contact

A feature-space attack perturbs the numeric vector directly. That measures the decision boundary cleanly, but the perturbed vector may not correspond to any artifact an attacker can actually build. You cannot flip arbitrary bytes in a PE file and still have it parse and execute; you cannot edit a URL’s entropy without changing the URL.

If your threat model is real malware or real network traffic, evaluate in the problem space: perturb the artifact under domain constraints (append-only sections, functionality-preserving transforms) and re-extract features, rather than editing the feature vector. Pierazzi et al. formalized this gap in Intriguing Properties of Adversarial ML Attacks in the Problem Space (IEEE S&P 2020). Unconstrained feature-space numbers can both overstate robustness (perturbations the attacker can never realize) and understate it, so state which space you measured in.

Don’t Stop at Evasion: Poisoning and Drift

Evasion is the loudest failure mode, but two others belong in any honest robustness evaluation of a security model:

Poisoning and backdoors (MITRE ATLAS AML.T0020, AML.T0018): if your model retrains on analyst-labeled or feedback-loop data, an attacker who can influence that data can plant a backdoor trigger. ART implements backdoor attacks for red-team testing and defenses such as activation clustering and spectral signatures to detect poisoned samples. Test the retraining pipeline, not just the deployed weights.
Distribution shift over time: malware, phishing kits, and C2 frameworks change. A model evaluated on a random train/test split looks far better than it performs on next month’s samples, because the random split leaks future-distribution information into training. Split by time and evaluate on a later period. The TESSERACT work (USENIX Security 2019) shows how temporal and spatial bias inflates malware-classifier results, and gives metrics (AUT) for measuring performance decay honestly.

A robustness report that covers only clean accuracy is marketing. One that states the attack used, the perturbation budget, the resulting robust accuracy and attack success rate, the evaluation space, and the temporal split is something you can make a security decision on.

This kind of evaluation is squarely where data science and adversarial thinking meet, which is the gap GTK Cyber’s applied data science and AI training and AI red-teaming course are built to close: teaching security practitioners to attack and measure the models their organizations depend on, not just to read their accuracy scores.

How to Evaluate ML Model Robustness for Security Use Cases

Robustness Is Performance Under an Adversary, Not on a Holdout

Pick the Right Attack for Your Model

A Minimal Evasion Evaluation with ART

Report Robust Accuracy, Not the Strongest Attack You Could Beat

Constraints: Feature-Space Wins That Don’t Survive Contact

Don’t Stop at Evasion: Poisoning and Drift

Frequently Asked Questions

Related posts

Adversarial Machine Learning Training for Security Teams: What to Learn

Best Training for Adversarial Machine Learning in Security

Who Teaches Applied AI and ML for Security Practitioners?

Want to learn more?