What lexical features best distinguish DGA domains from legitimate ones?

Shannon entropy of the registrable label and the longest consonant run do most of the work. Algorithmically generated names like kq3v9z7r1xw8 have entropy near 3.6 and consonant runs no English word reaches, while google sits around 2.6 entropy with a run of 2. Digit ratio, vowel ratio, and label length add signal. These string-only features need no external lookups, so feature extraction runs at pandas speed.

Why use a Random Forest classifier for DGA detection?

A RandomForest handles the nonlinear interactions between lexical features (entropy, consonant runs, digit ratio) without much tuning, and it gives you feature_importances_ to confirm the model keys on structure. With five lexical features you can expect accuracy in the mid-90s on arithmetic DGAs. Use n_estimators=200, max_depth=12, and class_weight='balanced' to handle class imbalance in the training corpus.

Can a classifier detect dictionary-based DGAs like suppobox?

Not reliably with lexical features alone. Dictionary DGAs stitch real words together (shippingfuture.net), so the output is pronounceable and has normal entropy and consonant runs. Catching those requires word-list and n-gram modeling against an English corpus, and even then it is hard. Be honest about this limit rather than claiming the model catches every family.

How do NXDOMAIN bursts help detect DGA activity without labels?

Because only a few generated domains are ever registered, an infected host produces a burst of failed DNS lookups that show up as NXDOMAIN responses. Group DNS logs by source host, count distinct NXDOMAIN queries, and flag hosts above your environment baseline (start around 50 distinct failures in a window). A host that is both generating an NXDOMAIN burst and querying high-entropy names is a strong detection, and combining the two signals cuts the false positives either produces alone.

What datasets can I use to train a DGA domain classifier?

Use a top-domains list such as Tranco or Cisco Umbrella for the benign class, and a DGA feed such as DGArchive or samples generated from known algorithms for the malicious class. Stratify your train/test split on the labels so both classes are represented proportionally in each fold.

Detecting DGA Domains with a Classifier in Python

Malware that relies on a hardcoded C2 address dies the moment that address is blocked. Domain Generation Algorithms (T1568.002) solve that for the attacker: the implant and the operator both generate the same large set of pseudo-random domains from a shared seed, the operator registers one, and the implant finds it by trying them. Blocklists cannot keep up with thousands of throwaway domains a day.

You cannot blocklist your way out of this, but you can classify it. A domain like kq3v9z7r1xw8.com does not look like a domain a human registered, and that difference is measurable. This is a textbook supervised-learning problem, and it is one of the cleanest demonstrations of machine learning for security.

What Makes a DGA Domain Look Different

Human-chosen domains are pronounceable and reuse common letter patterns. Algorithmically generated ones tend to have high character entropy, odd consonant-to-vowel ratios, and digit patterns that real brands avoid. Those properties survive across most DGA families, which is why a model trained on lexical features generalizes.

You also get a behavioral tell for free. Because only a few generated domains are ever registered, an infected host produces a burst of failed lookups, which show up as NXDOMAIN responses in DNS logs.

Engineering Lexical Features

Extract features from the domain string itself. No external lookups, so this runs at the speed of pandas:

import math
import pandas as pd
from collections import Counter

VOWELS = set("aeiou")

def shannon_entropy(s):
    counts = Counter(s)
    n = len(s)
    return -sum((c / n) * math.log2(c / n) for c in counts.values()) if n else 0.0

def features(domain):
    label = domain.split(".")[0].lower()       # registrable label, TLD stripped
    n = len(label) or 1
    longest = run = 0
    for ch in label:
        if ch.isalpha() and ch not in VOWELS:
            run += 1
            longest = max(longest, run)
        else:
            run = 0
    return {
        "length": len(label),
        "entropy": shannon_entropy(label),
        "digit_ratio": sum(c.isdigit() for c in label) / n,
        "vowel_ratio": sum(c in VOWELS for c in label) / n,
        "longest_consonant_run": longest,
    }

Entropy and the longest consonant run do most of the work. google has an entropy around 2.6 and a consonant run of 2; kq3v9z7r1xw8 has entropy near 3.6 and runs that no English word reaches.

Training the Classifier

Label a corpus and train. The standard approach uses a top-domains list (Tranco or Cisco Umbrella) as the benign class and a DGA feed (DGArchive or generated samples from known algorithms) as the malicious class. A RandomForest handles the nonlinear feature interactions without much tuning:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

feat = pd.DataFrame([features(d) for d in domains])
X_train, X_test, y_train, y_test = train_test_split(
    feat, labels, test_size=0.2, random_state=42, stratify=labels
)

clf = RandomForestClassifier(n_estimators=200, max_depth=12,
                             class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))

With these five features you can expect accuracy in the mid-90s on arithmetic DGAs. Adding character bigram frequencies scored against an English corpus pushes it higher.

Be honest about the limit: dictionary-based DGAs like suppobox, which stitch real words together (shippingfuture.net), defeat lexical features because the output is pronounceable. Catching those needs word-list and n-gram modeling, and even then it is hard. Say so rather than claiming the model catches everything.

Operationalizing Without Labels

In production you rarely have labels for live traffic. Combine the classifier score with the behavioral signal. Pull DNS logs, isolate the failed lookups, and find hosts generating many distinct NXDOMAIN queries:

dns = load_zeek("dns.log")             # id.orig_h, query, rcode_name
nx = dns[dns["rcode_name"] == "NXDOMAIN"]

burst = (nx.groupby("id.orig_h")["query"]
           .nunique().sort_values(ascending=False))
suspects = burst[burst > 50]           # tune to your environment baseline

A workstation throwing hundreds of distinct NXDOMAIN lookups in a short window is behaving like a host hunting for its C2 rendezvous. Run the classifier over those failed domains: a host that is both generating an NXDOMAIN burst and querying high-entropy names is a strong detection, and the two signals together cut the false positives that either produces alone (some CDNs and telemetry endpoints use random-looking names, but they resolve).

Classify the Pattern, Not the Domain

Domains are disposable, so an indicator feed of known-bad domains is always behind. A model that scores the string and a query that counts failed lookups both keep working against domains nobody has seen yet. That is the recurring theme of hunting with data: catch the generative behavior, not yesterday’s indicators.

This is the kind of applied ML we teach in GTK Cyber’s Threat Hunting with Data Science course, where students build classifiers like this on real security data. The T1568.002 reference page has the ATT&CK detail, and the beaconing detection post covers the C2 channel these domains are used to reach.

Detecting DGA Domains with a Classifier in Python

What Makes a DGA Domain Look Different

Engineering Lexical Features

Training the Classifier

Operationalizing Without Labels

Classify the Pattern, Not the Domain

Frequently Asked Questions

Related posts

How to Apply Machine Learning to Threat Hunting

Hunting for C2 Beaconing with Python

Threat Hunting Pipeline: Python, Jupyter, Beaconing

Want to learn more?