Malware that relies on a hardcoded C2 address dies the moment that address is blocked. Domain Generation Algorithms (T1568.002) solve that for the attacker: the implant and the operator both generate the same large set of pseudo-random domains from a shared seed, the operator registers one, and the implant finds it by trying them. Blocklists cannot keep up with thousands of throwaway domains a day.
You cannot blocklist your way out of this, but you can classify it. A domain like kq3v9z7r1xw8.com does not look like a domain a human registered, and that difference is measurable. This is a textbook supervised-learning problem, and it is one of the cleanest demonstrations of machine learning for security.
What Makes a DGA Domain Look Different
Human-chosen domains are pronounceable and reuse common letter patterns. Algorithmically generated ones tend to have high character entropy, odd consonant-to-vowel ratios, and digit patterns that real brands avoid. Those properties survive across most DGA families, which is why a model trained on lexical features generalizes.
You also get a behavioral tell for free. Because only a few generated domains are ever registered, an infected host produces a burst of failed lookups, which show up as NXDOMAIN responses in DNS logs.
Engineering Lexical Features
Extract features from the domain string itself. No external lookups, so this runs at the speed of pandas:
import math
import pandas as pd
from collections import Counter
VOWELS = set("aeiou")
def shannon_entropy(s):
counts = Counter(s)
n = len(s)
return -sum((c / n) * math.log2(c / n) for c in counts.values()) if n else 0.0
def features(domain):
label = domain.split(".")[0].lower() # registrable label, TLD stripped
n = len(label) or 1
longest = run = 0
for ch in label:
if ch.isalpha() and ch not in VOWELS:
run += 1
longest = max(longest, run)
else:
run = 0
return {
"length": len(label),
"entropy": shannon_entropy(label),
"digit_ratio": sum(c.isdigit() for c in label) / n,
"vowel_ratio": sum(c in VOWELS for c in label) / n,
"longest_consonant_run": longest,
}
Entropy and the longest consonant run do most of the work. google has an entropy around 2.6 and a consonant run of 2; kq3v9z7r1xw8 has entropy near 3.6 and runs that no English word reaches.
Training the Classifier
Label a corpus and train. The standard approach uses a top-domains list (Tranco or Cisco Umbrella) as the benign class and a DGA feed (DGArchive or generated samples from known algorithms) as the malicious class. A RandomForest handles the nonlinear feature interactions without much tuning:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
feat = pd.DataFrame([features(d) for d in domains])
X_train, X_test, y_train, y_test = train_test_split(
feat, labels, test_size=0.2, random_state=42, stratify=labels
)
clf = RandomForestClassifier(n_estimators=200, max_depth=12,
class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))
With these five features you can expect accuracy in the mid-90s on arithmetic DGAs. Adding character bigram frequencies scored against an English corpus pushes it higher.
Be honest about the limit: dictionary-based DGAs like suppobox, which stitch real words together (shippingfuture.net), defeat lexical features because the output is pronounceable. Catching those needs word-list and n-gram modeling, and even then it is hard. Say so rather than claiming the model catches everything.
Operationalizing Without Labels
In production you rarely have labels for live traffic. Combine the classifier score with the behavioral signal. Pull DNS logs, isolate the failed lookups, and find hosts generating many distinct NXDOMAIN queries:
dns = load_zeek("dns.log") # id.orig_h, query, rcode_name
nx = dns[dns["rcode_name"] == "NXDOMAIN"]
burst = (nx.groupby("id.orig_h")["query"]
.nunique().sort_values(ascending=False))
suspects = burst[burst > 50] # tune to your environment baseline
A workstation throwing hundreds of distinct NXDOMAIN lookups in a short window is behaving like a host hunting for its C2 rendezvous. Run the classifier over those failed domains: a host that is both generating an NXDOMAIN burst and querying high-entropy names is a strong detection, and the two signals together cut the false positives that either produces alone (some CDNs and telemetry endpoints use random-looking names, but they resolve).
Classify the Pattern, Not the Domain
Domains are disposable, so an indicator feed of known-bad domains is always behind. A model that scores the string and a query that counts failed lookups both keep working against domains nobody has seen yet. That is the recurring theme of hunting with data: catch the generative behavior, not yesterday’s indicators.
This is the kind of applied ML we teach in GTK Cyber’s Threat Hunting with Data Science course, where students build classifiers like this on real security data. The T1568.002 reference page has the ATT&CK detail, and the beaconing detection post covers the C2 channel these domains are used to reach.