# Building an ML Pipeline for Phishing URL Detection in Python

By Charles Givre · 2026-06-01

> Build a phishing URL classifier in Python: lexical and host features, a RandomForest model, threshold tuning for precision, and where lexical features break.

Phishing is still the most common way attackers get their first foothold ([Phishing, T1566](/mitre/T1566)). Block one campaign's domains and the next batch is registered an hour later, so an indicator feed of known-bad URLs is always a step behind. The structural tells of a phishing link, though, survive across campaigns, and those tells are measurable. That makes phishing URL detection a clean supervised-learning problem you can build and run in `pandas`.

This is the pipeline: parse the URL, turn it into features, train a classifier, and tune the threshold for the metric that actually matters in a SOC. None of it requires a GPU or a deep-learning framework.

## What a Phishing URL Gives Away

A credential-harvesting link (Spearphishing Link, [T1566.002](/mitre/T1566.002)) has to do two things at once: look plausible to a human glancing at it, and resolve to infrastructure the attacker controls. That tension leaves fingerprints.

- Brand impersonation in the path or subdomain rather than the registrable domain: `paypal.com.account-verify.ru` puts the trusted name where it is not authoritative.
- Tokens that nudge urgency or trust: `login`, `verify`, `secure`, `update`, `confirm`, `account`.
- Raw IPs as the host, excessive subdomain depth, long paths, and high digit counts in the domain.
- A registrable domain that has nothing to do with the brand being spoofed.

These are exactly the features a model can score. The point is to classify the structure, not memorize the string.

## Engineering URL Features

Use [`tldextract`](https://github.com/john-kurkowski/tldextract) to split the URL into subdomain, registrable domain, and suffix correctly, then derive features from each part. This runs at `pandas` speed with no network lookups:

```python
import math
import tldextract
from urllib.parse import urlparse
from collections import Counter

SUSPICIOUS = ("login", "verify", "secure", "update", "confirm",
              "account", "signin", "webscr", "ebayisapi")

def shannon_entropy(s):
    counts = Counter(s)
    n = len(s)
    return -sum((c / n) * math.log2(c / n) for c in counts.values()) if n else 0.0

def features(url):
    parsed = urlparse(url if "://" in url else "http://" + url)
    ext = tldextract.extract(url)
    host = parsed.netloc.lower()
    domain = ext.domain
    n = len(domain) or 1
    return {
        "url_length": len(url),
        "host_length": len(host),
        "subdomain_depth": ext.subdomain.count(".") + 1 if ext.subdomain else 0,
        "path_depth": parsed.path.count("/"),
        "num_query_params": parsed.query.count("=") if parsed.query else 0,
        "has_ip_host": 1 if host.replace(".", "").isdigit() else 0,
        "has_at_symbol": 1 if "@" in url else 0,
        "num_hyphens": host.count("-"),
        "digit_ratio": sum(c.isdigit() for c in domain) / n,
        "domain_entropy": shannon_entropy(domain),
        "suspicious_tokens": sum(t in url.lower() for t in SUSPICIOUS),
        "is_https": 1 if parsed.scheme == "https" else 0,
    }
```

`subdomain_depth`, `suspicious_tokens`, and `has_ip_host` carry a lot of the signal. `is_https` matters less than it used to now that free certificates are universal, but it is still mildly informative and costs nothing to keep.

## Training the Classifier

Label a corpus. The standard setup uses a live phishing feed ([PhishTank](https://phishtank.org/) or [OpenPhish](https://openphish.com/)) as the malicious class and a top-sites list ([Tranco](https://tranco-list.eu/) or Cisco Umbrella) as the benign class. A [RandomForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) handles the nonlinear interactions between these features without much tuning:

```python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

feat = pd.DataFrame([features(u) for u in urls])
X_train, X_test, y_train, y_test = train_test_split(
    feat, labels, test_size=0.2, random_state=42, stratify=labels
)

clf = RandomForestClassifier(n_estimators=200, max_depth=14,
                             class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))
```

With this feature set you can expect accuracy in the mid-90s on PhishTank-versus-Tranco splits. Do not report accuracy alone: a benign-heavy stream makes accuracy look great while the model quietly misses the rare positive. Read precision and recall on the phishing class, and pull `clf.feature_importances_` to confirm the model is keying on structure (subdomain depth, tokens) rather than overfitting to the entropy of one DGA-like campaign in the training feed.

If you want more headroom, gradient boosting ([LightGBM](https://lightgbm.readthedocs.io/) or [XGBoost](https://xgboost.readthedocs.io/)) on the same features usually buys a few points of recall at the same precision.

## Tuning for Precision, Not Accuracy

In production the cost of errors is asymmetric. A false negative lets one link through; a false positive blocks a legitimate business email and lands on an analyst's queue or, worse, breaks a customer workflow. Tune the decision threshold instead of accepting the default 0.5:

```python
from sklearn.metrics import precision_recall_curve
import numpy as np

probs = clf.predict_proba(X_test)[:, 1]
prec, rec, thr = precision_recall_curve(y_test, probs)

# pick the lowest threshold that holds precision >= 0.98
target = 0.98
idx = np.argmax(prec[:-1] >= target)
print(f"threshold={thr[idx]:.3f}  precision={prec[idx]:.3f}  recall={rec[idx]:.3f}")
```

Set the operating point against analyst bandwidth and tolerance for blocking, not against a leaderboard number. A model running at 0.98 precision and 0.80 recall is usually more useful at a mail gateway than one at 0.94 precision and 0.92 recall, because the second one's false positives erode trust in the whole control.

## Where Lexical Features Break

Be honest about the failure modes, because attackers read the same playbook:

- **Compromised legitimate sites.** When a phishing kit is hosted on a hacked WordPress install at a normal-looking domain, every lexical feature says benign. You need URL reputation, page content analysis, or the absence of the brand's real login flow to catch it.
- **URL shorteners and open redirects.** `bit.ly/x7k2` carries no signal until it is expanded. Resolve shorteners and follow redirects before scoring.
- **Homograph and IDN spoofing.** Punycode domains like `xn--pypal-4ve.com` render as a trusted brand. Decode to Unicode and add a confusable-character check rather than relying on raw ASCII features.

Lexical URL features are a fast, cheap first layer, not the whole control. Pair the classifier with sender reputation, DMARC/SPF results, and content features (a [TF-IDF](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) model over the email body catches campaigns that the URL alone does not). And retrain on a schedule: phishing structure drifts as kits evolve, so a model frozen six months ago will slowly bleed recall.

## Classify the Pattern, Not the Indicator

A blocklist of known-bad URLs is obsolete the moment the next domain is registered. A model that scores structure keeps working against links nobody has seen yet, which is the same idea behind [detecting DGA domains](/blog/detecting-dga-domains-python) and [hunting C2 beaconing](/blog/hunting-c2-beaconing-python): catch the generative behavior, not yesterday's indicators.

This is the kind of applied ML we teach in GTK Cyber's [Applied Data Science and AI](/courses/applied-data-science-ai) and [Threat Hunting with Data Science](/courses/threat-hunting-data-science) courses, where students build and tune classifiers like this on real security data. The [T1566 reference page](/mitre/T1566) has the ATT&CK detail on the phishing techniques these URLs are used to deliver.

## FAQ

### What URL features are most useful for detecting phishing links?

Subdomain depth, count of suspicious tokens (login, verify, secure, update, confirm, account), and whether the host is a raw IP carry a lot of the signal. Also useful: URL and host length, path depth, number of hyphens, digit ratio in the registrable domain, domain entropy, and presence of an @ symbol. Use tldextract to split the URL into subdomain, registrable domain, and suffix correctly before deriving features. is_https matters less than it used to now that free certificates are universal.

### Why tune the decision threshold instead of using the default 0.5?

The cost of errors is asymmetric in production. A false negative lets one link through, but a false positive blocks a legitimate business email and erodes trust in the whole control. Use precision_recall_curve to pick the lowest threshold that holds your target precision (for example 0.98). A model at 0.98 precision and 0.80 recall is usually more useful at a mail gateway than one at 0.94 precision and 0.92 recall, because the second one's false positives undermine confidence in the control.

### Why shouldn't I report accuracy alone for a phishing URL classifier?

A benign-heavy stream makes accuracy look great while the model quietly misses the rare positive. Read precision and recall on the phishing class instead, and pull clf.feature_importances_ to confirm the model keys on structure like subdomain depth and tokens rather than overfitting to one DGA-like campaign in the training feed.

### Where do lexical URL features fail to detect phishing?

Three main failure modes. Compromised legitimate sites: a phishing kit hosted on a hacked WordPress install at a normal-looking domain makes every lexical feature say benign. URL shorteners and open redirects: bit.ly/x7k2 carries no signal until expanded, so resolve and follow redirects before scoring. Homograph and IDN spoofing: punycode domains like xn--pypal-4ve.com render as a trusted brand, so decode to Unicode and add a confusable-character check. Pair the classifier with sender reputation, DMARC/SPF results, and content features like a TF-IDF model over the email body.

### What datasets train a phishing URL classifier?

Use a live phishing feed such as PhishTank or OpenPhish for the malicious class and a top-sites list such as Tranco or Cisco Umbrella for the benign class. Expect accuracy in the mid-90s on a PhishTank-versus-Tranco split with a RandomForest. Retrain on a schedule, because phishing structure drifts as kits evolve and a model frozen six months ago will slowly bleed recall.


---

Canonical: https://gtkcyber.com/blog/building-ml-phishing-detection-pipeline/