A malware classifier that scores 99% on a held-out test set tells you one thing: the model works when nobody is trying to defeat it. For a security model, that is the uninteresting case. The attacker picks the input. The right question is not “how accurate is it on the test set,” but “how much accuracy survives an adversary who is searching for the input that breaks it.”
That number is almost always lower, and for a lot of deployed detection models it is close to zero. Evaluating robustness means measuring performance under adversarial perturbation and under distribution shift, then reporting it honestly. Here is how to do it.
Robustness Is Performance Under an Adversary, Not on a Holdout
Standard evaluation samples the same distribution the model trained on. A robustness evaluation assumes an adversary who optimizes the input against your model. The two metrics that matter:
- Robust accuracy: accuracy on adversarial examples constrained to a stated perturbation budget (for example, L-infinity epsilon = 0.03 on normalized features). Report the budget. Robust accuracy with no epsilon is meaningless.
- Attack success rate: of the inputs the model originally classified correctly, the fraction an attacker can flip within that budget. This is the operational number: it is the share of detections an evader defeats.
These map directly onto a threat model. MITRE ATLAS tracks evading a deployed model as AML.T0015 with adversarial input crafted in AML.T0043. NIST AI 100-2 gives the taxonomy and vocabulary. Decide which of those your model has to withstand before you run anything.
Pick the Right Attack for Your Model
The attack you can run depends on what access you assume and whether the model is differentiable.
- White-box, differentiable (neural nets): gradient attacks. Fast Gradient Sign Method (FGSM) for a one-step baseline, Projected Gradient Descent (PGD) for an iterative one. For a defensible robust-accuracy number, use AutoAttack, a parameter-free ensemble (APGD-CE, APGD-DLR, FAB, Square) built specifically to avoid the weak-attack inflation discussed below.
- Black-box, query access only (random forests, gradient-boosted trees, anything behind an API): decision-based attacks that need only the predicted label. HopSkipJump, ZOO, and Boundary attacks estimate a direction to the boundary from query responses. This is the realistic setting for most production security models, where an attacker hits an inference endpoint and never sees gradients.
Most security classifiers in the field are tree ensembles on tabular features, which are not differentiable, so the black-box path is usually the honest one.
A Minimal Evasion Evaluation with ART
The Adversarial Robustness Toolbox (ART) wraps your trained model and runs the attacks. Here is a complete evasion evaluation against a scikit-learn classifier using the decision-based HopSkipJump attack, which needs only predict():
import numpy as np
from art.estimators.classification import SklearnClassifier
from art.attacks.evasion import HopSkipJump
# clf: an already-trained sklearn classifier (e.g. RandomForestClassifier)
# X_test, y_test: held-out evaluation data, features scaled to [0, 1]
classifier = SklearnClassifier(model=clf)
clean_pred = clf.predict(X_test)
clean_acc = (clean_pred == y_test).mean()
# only attack inputs the model currently gets right
correct = clean_pred == y_test
X_eval, y_eval = X_test[correct], y_test[correct]
attack = HopSkipJump(classifier=classifier, targeted=False,
norm=np.inf, max_iter=50, max_eval=1000)
X_adv = attack.generate(x=X_eval)
adv_pred = clf.predict(X_adv)
robust_acc = (adv_pred == y_eval).mean() # accuracy under attack
success_rate = (adv_pred != y_eval).mean() # fraction flipped
linf = np.abs(X_adv - X_eval).max(axis=1) # budget actually used
print(f"clean accuracy: {clean_acc:.3f}")
print(f"robust accuracy: {robust_acc:.3f}")
print(f"attack success: {success_rate:.3f} at median Linf {np.median(linf):.3f}")
Run this at several perturbation budgets and plot robust accuracy against epsilon. A useful model degrades gracefully. A model whose robust accuracy collapses to near zero at a perturbation an attacker can trivially produce is not providing the protection its test accuracy implies.
Report Robust Accuracy, Not the Strongest Attack You Could Beat
The most common evaluation failure is picking a weak attack and reporting the survivable number. Athalye, Carlini, and Wagner showed in Obfuscated Gradients Give a False Sense of Security (ICML 2018) that several published defenses were not robust at all; their evaluations simply used attacks that could not find the adversarial examples that existed. Gradient masking does the same thing quietly.
Three habits prevent this:
- Use a strong ensemble (AutoAttack) for the headline robust-accuracy number, not a single FGSM pass.
- Add a black-box decision-based attack as an independent check. It does not use gradients, so it cannot be fooled by gradient masking. If the black-box attack succeeds where the white-box attack failed, your white-box evaluation was broken.
- Sanity check the curve: attack success should approach 100% as the budget grows. If raising epsilon does not raise success, suspect the evaluation.
For image models there is a shortcut to a credible number: RobustBench publishes standardized AutoAttack leaderboards and pretrained robust models you can compare against.
Constraints: Feature-Space Wins That Don’t Survive Contact
A feature-space attack perturbs the numeric vector directly. That measures the decision boundary cleanly, but the perturbed vector may not correspond to any artifact an attacker can actually build. You cannot flip arbitrary bytes in a PE file and still have it parse and execute; you cannot edit a URL’s entropy without changing the URL.
If your threat model is real malware or real network traffic, evaluate in the problem space: perturb the artifact under domain constraints (append-only sections, functionality-preserving transforms) and re-extract features, rather than editing the feature vector. Pierazzi et al. formalized this gap in Intriguing Properties of Adversarial ML Attacks in the Problem Space (IEEE S&P 2020). Unconstrained feature-space numbers can both overstate robustness (perturbations the attacker can never realize) and understate it, so state which space you measured in.
Don’t Stop at Evasion: Poisoning and Drift
Evasion is the loudest failure mode, but two others belong in any honest robustness evaluation of a security model:
- Poisoning and backdoors (MITRE ATLAS AML.T0020, AML.T0018): if your model retrains on analyst-labeled or feedback-loop data, an attacker who can influence that data can plant a backdoor trigger. ART implements backdoor attacks for red-team testing and defenses such as activation clustering and spectral signatures to detect poisoned samples. Test the retraining pipeline, not just the deployed weights.
- Distribution shift over time: malware, phishing kits, and C2 frameworks change. A model evaluated on a random train/test split looks far better than it performs on next month’s samples, because the random split leaks future-distribution information into training. Split by time and evaluate on a later period. The TESSERACT work (USENIX Security 2019) shows how temporal and spatial bias inflates malware-classifier results, and gives metrics (AUT) for measuring performance decay honestly.
A robustness report that covers only clean accuracy is marketing. One that states the attack used, the perturbation budget, the resulting robust accuracy and attack success rate, the evaluation space, and the temporal split is something you can make a security decision on.
This kind of evaluation is squarely where data science and adversarial thinking meet, which is the gap GTK Cyber’s applied data science and AI training and AI red-teaming course are built to close: teaching security practitioners to attack and measure the models their organizations depend on, not just to read their accuracy scores.