What URL features are most useful for detecting phishing links?

Subdomain depth, count of suspicious tokens (login, verify, secure, update, confirm, account), and whether the host is a raw IP carry a lot of the signal. Also useful: URL and host length, path depth, number of hyphens, digit ratio in the registrable domain, domain entropy, and presence of an @ symbol. Use tldextract to split the URL into subdomain, registrable domain, and suffix correctly before deriving features. is_https matters less than it used to now that free certificates are universal.

Why tune the decision threshold instead of using the default 0.5?

The cost of errors is asymmetric in production. A false negative lets one link through, but a false positive blocks a legitimate business email and erodes trust in the whole control. Use precision_recall_curve to pick the lowest threshold that holds your target precision (for example 0.98). A model at 0.98 precision and 0.80 recall is usually more useful at a mail gateway than one at 0.94 precision and 0.92 recall, because the second one's false positives undermine confidence in the control.

Why shouldn't I report accuracy alone for a phishing URL classifier?

A benign-heavy stream makes accuracy look great while the model quietly misses the rare positive. Read precision and recall on the phishing class instead, and pull clf.feature_importances_ to confirm the model keys on structure like subdomain depth and tokens rather than overfitting to one DGA-like campaign in the training feed.

Where do lexical URL features fail to detect phishing?

Three main failure modes. Compromised legitimate sites: a phishing kit hosted on a hacked WordPress install at a normal-looking domain makes every lexical feature say benign. URL shorteners and open redirects: bit.ly/x7k2 carries no signal until expanded, so resolve and follow redirects before scoring. Homograph and IDN spoofing: punycode domains like xn--pypal-4ve.com render as a trusted brand, so decode to Unicode and add a confusable-character check. Pair the classifier with sender reputation, DMARC/SPF results, and content features like a TF-IDF model over the email body.

What datasets train a phishing URL classifier?

Use a live phishing feed such as PhishTank or OpenPhish for the malicious class and a top-sites list such as Tranco or Cisco Umbrella for the benign class. Expect accuracy in the mid-90s on a PhishTank-versus-Tranco split with a RandomForest. Retrain on a schedule, because phishing structure drifts as kits evolve and a model frozen six months ago will slowly bleed recall.

Building an ML Pipeline for Phishing URL Detection in Python

Phishing is still the most common way attackers get their first foothold (Phishing, T1566). Block one campaign’s domains and the next batch is registered an hour later, so an indicator feed of known-bad URLs is always a step behind. The structural tells of a phishing link, though, survive across campaigns, and those tells are measurable. That makes phishing URL detection a clean supervised-learning problem you can build and run in pandas.

This is the pipeline: parse the URL, turn it into features, train a classifier, and tune the threshold for the metric that actually matters in a SOC. None of it requires a GPU or a deep-learning framework.

What a Phishing URL Gives Away

A credential-harvesting link (Spearphishing Link, T1566.002) has to do two things at once: look plausible to a human glancing at it, and resolve to infrastructure the attacker controls. That tension leaves fingerprints.

Brand impersonation in the path or subdomain rather than the registrable domain: paypal.com.account-verify.ru puts the trusted name where it is not authoritative.
Tokens that nudge urgency or trust: login, verify, secure, update, confirm, account.
Raw IPs as the host, excessive subdomain depth, long paths, and high digit counts in the domain.
A registrable domain that has nothing to do with the brand being spoofed.

These are exactly the features a model can score. The point is to classify the structure, not memorize the string.

Engineering URL Features

Use tldextract to split the URL into subdomain, registrable domain, and suffix correctly, then derive features from each part. This runs at pandas speed with no network lookups:

import math
import tldextract
from urllib.parse import urlparse
from collections import Counter

SUSPICIOUS = ("login", "verify", "secure", "update", "confirm",
              "account", "signin", "webscr", "ebayisapi")

def shannon_entropy(s):
    counts = Counter(s)
    n = len(s)
    return -sum((c / n) * math.log2(c / n) for c in counts.values()) if n else 0.0

def features(url):
    parsed = urlparse(url if "://" in url else "http://" + url)
    ext = tldextract.extract(url)
    host = parsed.netloc.lower()
    domain = ext.domain
    n = len(domain) or 1
    return {
        "url_length": len(url),
        "host_length": len(host),
        "subdomain_depth": ext.subdomain.count(".") + 1 if ext.subdomain else 0,
        "path_depth": parsed.path.count("/"),
        "num_query_params": parsed.query.count("=") if parsed.query else 0,
        "has_ip_host": 1 if host.replace(".", "").isdigit() else 0,
        "has_at_symbol": 1 if "@" in url else 0,
        "num_hyphens": host.count("-"),
        "digit_ratio": sum(c.isdigit() for c in domain) / n,
        "domain_entropy": shannon_entropy(domain),
        "suspicious_tokens": sum(t in url.lower() for t in SUSPICIOUS),
        "is_https": 1 if parsed.scheme == "https" else 0,
    }

subdomain_depth, suspicious_tokens, and has_ip_host carry a lot of the signal. is_https matters less than it used to now that free certificates are universal, but it is still mildly informative and costs nothing to keep.

Training the Classifier

Label a corpus. The standard setup uses a live phishing feed (PhishTank or OpenPhish) as the malicious class and a top-sites list (Tranco or Cisco Umbrella) as the benign class. A RandomForest handles the nonlinear interactions between these features without much tuning:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

feat = pd.DataFrame([features(u) for u in urls])
X_train, X_test, y_train, y_test = train_test_split(
    feat, labels, test_size=0.2, random_state=42, stratify=labels
)

clf = RandomForestClassifier(n_estimators=200, max_depth=14,
                             class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))

With this feature set you can expect accuracy in the mid-90s on PhishTank-versus-Tranco splits. Do not report accuracy alone: a benign-heavy stream makes accuracy look great while the model quietly misses the rare positive. Read precision and recall on the phishing class, and pull clf.feature_importances_ to confirm the model is keying on structure (subdomain depth, tokens) rather than overfitting to the entropy of one DGA-like campaign in the training feed.

If you want more headroom, gradient boosting (LightGBM or XGBoost) on the same features usually buys a few points of recall at the same precision.

Tuning for Precision, Not Accuracy

In production the cost of errors is asymmetric. A false negative lets one link through; a false positive blocks a legitimate business email and lands on an analyst’s queue or, worse, breaks a customer workflow. Tune the decision threshold instead of accepting the default 0.5:

from sklearn.metrics import precision_recall_curve
import numpy as np

probs = clf.predict_proba(X_test)[:, 1]
prec, rec, thr = precision_recall_curve(y_test, probs)

# pick the lowest threshold that holds precision >= 0.98
target = 0.98
idx = np.argmax(prec[:-1] >= target)
print(f"threshold={thr[idx]:.3f}  precision={prec[idx]:.3f}  recall={rec[idx]:.3f}")

Set the operating point against analyst bandwidth and tolerance for blocking, not against a leaderboard number. A model running at 0.98 precision and 0.80 recall is usually more useful at a mail gateway than one at 0.94 precision and 0.92 recall, because the second one’s false positives erode trust in the whole control.

Where Lexical Features Break

Be honest about the failure modes, because attackers read the same playbook:

Compromised legitimate sites. When a phishing kit is hosted on a hacked WordPress install at a normal-looking domain, every lexical feature says benign. You need URL reputation, page content analysis, or the absence of the brand’s real login flow to catch it.
URL shorteners and open redirects. bit.ly/x7k2 carries no signal until it is expanded. Resolve shorteners and follow redirects before scoring.
Homograph and IDN spoofing. Punycode domains like xn--pypal-4ve.com render as a trusted brand. Decode to Unicode and add a confusable-character check rather than relying on raw ASCII features.

Lexical URL features are a fast, cheap first layer, not the whole control. Pair the classifier with sender reputation, DMARC/SPF results, and content features (a TF-IDF model over the email body catches campaigns that the URL alone does not). And retrain on a schedule: phishing structure drifts as kits evolve, so a model frozen six months ago will slowly bleed recall.

Classify the Pattern, Not the Indicator

A blocklist of known-bad URLs is obsolete the moment the next domain is registered. A model that scores structure keeps working against links nobody has seen yet, which is the same idea behind detecting DGA domains and hunting C2 beaconing: catch the generative behavior, not yesterday’s indicators.

This is the kind of applied ML we teach in GTK Cyber’s Applied Data Science and AI and Threat Hunting with Data Science courses, where students build and tune classifiers like this on real security data. The T1566 reference page has the ATT&CK detail on the phishing techniques these URLs are used to deliver.

Building an ML Pipeline for Phishing URL Detection in Python

What a Phishing URL Gives Away

Engineering URL Features

Training the Classifier

Tuning for Precision, Not Accuracy

Where Lexical Features Break

Classify the Pattern, Not the Indicator

Frequently Asked Questions

Related posts

How to Apply Machine Learning to Threat Hunting

How to Reduce False Positives in Security Alerts with Machine Learning

Data Science for Faster Incident Response

Want to learn more?