Building an ML Pipeline for Phishing URL Detection in Python

By Charles Givre · June 1, 2026

machine learningphishingPythondata sciencethreat detectionSOC

Phishing is still the most common way attackers get their first foothold (Phishing, T1566). Block one campaign’s domains and the next batch is registered an hour later, so an indicator feed of known-bad URLs is always a step behind. The structural tells of a phishing link, though, survive across campaigns, and those tells are measurable. That makes phishing URL detection a clean supervised-learning problem you can build and run in pandas.

This is the pipeline: parse the URL, turn it into features, train a classifier, and tune the threshold for the metric that actually matters in a SOC. None of it requires a GPU or a deep-learning framework.

What a Phishing URL Gives Away

A credential-harvesting link (Spearphishing Link, T1566.002) has to do two things at once: look plausible to a human glancing at it, and resolve to infrastructure the attacker controls. That tension leaves fingerprints.

  • Brand impersonation in the path or subdomain rather than the registrable domain: paypal.com.account-verify.ru puts the trusted name where it is not authoritative.
  • Tokens that nudge urgency or trust: login, verify, secure, update, confirm, account.
  • Raw IPs as the host, excessive subdomain depth, long paths, and high digit counts in the domain.
  • A registrable domain that has nothing to do with the brand being spoofed.

These are exactly the features a model can score. The point is to classify the structure, not memorize the string.

Engineering URL Features

Use tldextract to split the URL into subdomain, registrable domain, and suffix correctly, then derive features from each part. This runs at pandas speed with no network lookups:

import math
import tldextract
from urllib.parse import urlparse
from collections import Counter

SUSPICIOUS = ("login", "verify", "secure", "update", "confirm",
              "account", "signin", "webscr", "ebayisapi")

def shannon_entropy(s):
    counts = Counter(s)
    n = len(s)
    return -sum((c / n) * math.log2(c / n) for c in counts.values()) if n else 0.0

def features(url):
    parsed = urlparse(url if "://" in url else "http://" + url)
    ext = tldextract.extract(url)
    host = parsed.netloc.lower()
    domain = ext.domain
    n = len(domain) or 1
    return {
        "url_length": len(url),
        "host_length": len(host),
        "subdomain_depth": ext.subdomain.count(".") + 1 if ext.subdomain else 0,
        "path_depth": parsed.path.count("/"),
        "num_query_params": parsed.query.count("=") if parsed.query else 0,
        "has_ip_host": 1 if host.replace(".", "").isdigit() else 0,
        "has_at_symbol": 1 if "@" in url else 0,
        "num_hyphens": host.count("-"),
        "digit_ratio": sum(c.isdigit() for c in domain) / n,
        "domain_entropy": shannon_entropy(domain),
        "suspicious_tokens": sum(t in url.lower() for t in SUSPICIOUS),
        "is_https": 1 if parsed.scheme == "https" else 0,
    }

subdomain_depth, suspicious_tokens, and has_ip_host carry a lot of the signal. is_https matters less than it used to now that free certificates are universal, but it is still mildly informative and costs nothing to keep.

Training the Classifier

Label a corpus. The standard setup uses a live phishing feed (PhishTank or OpenPhish) as the malicious class and a top-sites list (Tranco or Cisco Umbrella) as the benign class. A RandomForest handles the nonlinear interactions between these features without much tuning:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

feat = pd.DataFrame([features(u) for u in urls])
X_train, X_test, y_train, y_test = train_test_split(
    feat, labels, test_size=0.2, random_state=42, stratify=labels
)

clf = RandomForestClassifier(n_estimators=200, max_depth=14,
                             class_weight="balanced", random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))

With this feature set you can expect accuracy in the mid-90s on PhishTank-versus-Tranco splits. Do not report accuracy alone: a benign-heavy stream makes accuracy look great while the model quietly misses the rare positive. Read precision and recall on the phishing class, and pull clf.feature_importances_ to confirm the model is keying on structure (subdomain depth, tokens) rather than overfitting to the entropy of one DGA-like campaign in the training feed.

If you want more headroom, gradient boosting (LightGBM or XGBoost) on the same features usually buys a few points of recall at the same precision.

Tuning for Precision, Not Accuracy

In production the cost of errors is asymmetric. A false negative lets one link through; a false positive blocks a legitimate business email and lands on an analyst’s queue or, worse, breaks a customer workflow. Tune the decision threshold instead of accepting the default 0.5:

from sklearn.metrics import precision_recall_curve
import numpy as np

probs = clf.predict_proba(X_test)[:, 1]
prec, rec, thr = precision_recall_curve(y_test, probs)

# pick the lowest threshold that holds precision >= 0.98
target = 0.98
idx = np.argmax(prec[:-1] >= target)
print(f"threshold={thr[idx]:.3f}  precision={prec[idx]:.3f}  recall={rec[idx]:.3f}")

Set the operating point against analyst bandwidth and tolerance for blocking, not against a leaderboard number. A model running at 0.98 precision and 0.80 recall is usually more useful at a mail gateway than one at 0.94 precision and 0.92 recall, because the second one’s false positives erode trust in the whole control.

Where Lexical Features Break

Be honest about the failure modes, because attackers read the same playbook:

  • Compromised legitimate sites. When a phishing kit is hosted on a hacked WordPress install at a normal-looking domain, every lexical feature says benign. You need URL reputation, page content analysis, or the absence of the brand’s real login flow to catch it.
  • URL shorteners and open redirects. bit.ly/x7k2 carries no signal until it is expanded. Resolve shorteners and follow redirects before scoring.
  • Homograph and IDN spoofing. Punycode domains like xn--pypal-4ve.com render as a trusted brand. Decode to Unicode and add a confusable-character check rather than relying on raw ASCII features.

Lexical URL features are a fast, cheap first layer, not the whole control. Pair the classifier with sender reputation, DMARC/SPF results, and content features (a TF-IDF model over the email body catches campaigns that the URL alone does not). And retrain on a schedule: phishing structure drifts as kits evolve, so a model frozen six months ago will slowly bleed recall.

Classify the Pattern, Not the Indicator

A blocklist of known-bad URLs is obsolete the moment the next domain is registered. A model that scores structure keeps working against links nobody has seen yet, which is the same idea behind detecting DGA domains and hunting C2 beaconing: catch the generative behavior, not yesterday’s indicators.

This is the kind of applied ML we teach in GTK Cyber’s Applied Data Science and AI and Threat Hunting with Data Science courses, where students build and tune classifiers like this on real security data. The T1566 reference page has the ATT&CK detail on the phishing techniques these URLs are used to deliver.

Want to learn more?

Explore our hands-on AI and cybersecurity training courses.

View Courses