A SOC does not have an alert problem. It has a false-positive problem. The detections fire, the queue fills, and analysts spend their day closing tickets that were never going to be incidents. Tuning rules helps, but rules are blunt: tighten one and you lose coverage, loosen it and the noise comes back.
Machine learning fits here, but not the way most vendors pitch it. You are not replacing your detection rules. You are adding a layer that ranks what they produce, so the alert most likely to be real sits at the top of the queue and the obvious noise sorts itself to the bottom. Framed correctly, this is a supervised learning problem with training data you already have.
The Data You Already Have
Every alert an analyst closed is a label. A ticket closed as a false positive is a negative example. An escalated or confirmed incident is a positive. Your SIEM, SOAR, or ticketing system has been generating this dataset for years.
The first task is extracting it. Pull historical alerts with their dispositions and build a feature table. Useful features are mostly metadata, not packet contents:
- Rule identity: which detection fired, and that rule’s historical false-positive rate
- Asset context: criticality of the destination host, whether the account is privileged
- Temporal: hour of day, day of week, time since last alert on this entity
- Correlation: count of related alerts on the same host or user in the last hour
- Reputation: source IP ASN, whether the domain is newly registered
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
# label: 1 = true positive (escalated/confirmed), 0 = false positive (closed benign)
cat = ['rule_name', 'asset_criticality', 'src_asn']
num = ['hour', 'related_alerts_1h', 'rule_historical_fp_rate', 'account_is_priv']
pre = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat),
('num', StandardScaler(), num),
])
clf = Pipeline([
('pre', pre),
('model', GradientBoostingClassifier(n_estimators=300, max_depth=3,
learning_rate=0.05)),
])
clf.fit(X_train, y_train)
Optimize for the Right Thing
Accuracy is the wrong metric. If 95% of alerts are false positives, a model that calls everything a false positive is 95% accurate and operationally useless, because it closes real attacks. The metric that matters is recall on true positives: of the alerts that were real, how many did the model keep?
Set the decision threshold deliberately. The default 0.5 cutoff is arbitrary. Use the precision-recall curve to find the probability threshold that holds recall at the level you can defend, then accept whatever precision that buys you:
from sklearn.metrics import precision_recall_curve
import numpy as np
probs = clf.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, probs)
# Highest threshold that still keeps 99% of true positives
target_recall = 0.99
ok = recall[:-1] >= target_recall
chosen = thresholds[ok].max()
print(f"threshold={chosen:.3f}, "
f"precision={precision[:-1][ok][np.argmax(thresholds[ok])]:.3f}")
In practice you do not auto-close anything above the threshold. You rank the queue by probability and auto-close only the lowest-risk band, with every auto-closure logged and a weekly sample audited. The model reorders work; analyst judgment still owns the high-confidence detections.
Collapse the Storms First
Before any classifier, the fastest win is deduplication, and it needs no labels. A vulnerability scanner tripping one rule across 400 hosts is one event presented as 400 tickets. Cluster the alert stream and the storm collapses into a single incident to review.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
text = (alerts['rule_name'] + ' ' + alerts['process_cmdline'].fillna('')).str.lower()
X = TfidfVectorizer(min_df=2).fit_transform(text)
alerts['incident'] = DBSCAN(eps=0.4, min_samples=3, metric='cosine').fit_predict(X)
Clustering does not judge true versus false positive. It removes the duplicate volume that drives most of the fatigue, which is why it is worth doing even before you train anything.
Don’t Create Blind Spots
The failure mode of automated suppression is silent loss of coverage. A model that learns to down-rank a noisy rule may be down-ranking the one technique an attacker is about to use. Guard against it by mapping every detection you suppress or de-prioritize back to the MITRE ATT&CK technique it covers (T1110 brute force, T1059 command execution, and so on). If suppressing a rule means a technique now has no high-priority coverage, that is a decision a human makes, not a side effect of a probability score.
Models also drift. New rules, new infrastructure, and new normal behavior shift the input distribution, and precision quietly degrades. Monitor precision and recall on a rolling window of fresh dispositions, watch feature distributions for drift, and retrain on a schedule. Because every new analyst disposition is a new label, the feedback loop that produced your training data keeps producing it.
Where to Learn This
This is applied data science, not a product you buy. The skills are concrete: building feature tables from alert metadata, choosing thresholds from a precision-recall curve, and validating that a model is not hiding an attack technique behind a low score. They transfer across whatever SIEM and SOAR you run.
GTK Cyber’s applied data science and AI training teaches exactly this workflow hands-on, with labs that build alert-triage and clustering models against realistic SOC data, including the threshold tuning and drift monitoring that separate a useful model from one that quietly closes real incidents.