Threat Hunting with Data Science

Apply machine learning and data science to hunt and identify threats. Build models for anomaly detection, phishing, DGA, and SQL injection detection.

Overview

Security teams generate more data than analysts can process manually. Signatures and rules catch known threats, but advanced attackers blend into normal traffic, move slowly, and use legitimate tools. Threat hunters need techniques that find what rules miss.

This 32-hour course teaches security professionals to apply machine learning and data science to hunt and identify threats within their organizations. 50% of class time is instructor-led, and 50% is hands-on labs using Jupyter notebooks with real security datasets.

What You Will Learn

  • Understand and apply machine learning to identify organizational anomalies
  • Create machine learning models specific to your organization’s data and threat profile
  • Operationalize ML projects for phishing detection, DGA identification, and SQL injection classification
  • Tune models to improve prediction performance and reduce false positives
  • Train systems to make detection decisions at scale

Who This Is For

Threat hunters, SOC analysts, and security engineers who want to move beyond signature-based detection. You should be comfortable with basic Python (or have completed GTK Cyber’s Python for Security Analysts course).

Students who complete this course are prepared for the AI Cyber Bootcamp, which covers advanced topics including generative AI, LLM security, and adversarial AI testing.

Topics covered

  • Machine learning fundamentals for threat hunting
  • Anomaly detection and identification techniques
  • Organization-specific ML model creation
  • Phishing detection with ML
  • Domain generation algorithm (DGA) detection
  • SQL injection detection
  • Model tuning to reduce false positives
  • Operationalizing ML projects for security

Tools & technologies

PythonJupyterPandasscikit-learnCentaur VM

Frequently Asked Questions

Do I need machine learning experience before taking a threat hunting with data science course?
Basic Python is the main prerequisite. You should be able to read and write simple scripts, work with lists and dictionaries, and import libraries. You do not need prior machine learning or statistics experience. The course builds ML concepts from fundamentals applied directly to security data. If you have no Python background, completing a Python-for-security-analysts course first will make the labs significantly smoother.
How do you use machine learning to detect domain generation algorithms (DGA)?
DGA-generated domains have statistical properties that distinguish them from legitimate domains: high entropy, unusual character n-gram distributions, specific length patterns, and low Alexa or Tranco ranking. Features like character-level entropy, consonant-to-vowel ratio, n-gram frequency scores against a legitimate domain corpus, and domain length feed into a classifier (Random Forest or a character-level LSTM for higher accuracy). Train on labeled datasets of known DGA families (e.g., Conficker, Cryptolocker) and benign domains. scikit-learn's RandomForestClassifier is a practical starting point before adding deep learning approaches.
What is the practical difference between signature-based detection and ML-based threat hunting?
Signature-based detection matches known patterns exactly. It is reliable and fast for known threats but produces nothing on new variants, packed malware, or attackers who modify their tooling. ML-based detection learns statistical patterns from labeled data and generalizes to variants that match those patterns even if the exact signature doesn't exist. The tradeoff: ML models require training data, produce false positives that need tuning, and can be evaded by adversaries who understand the feature space. In practice, effective threat hunting uses both: signatures for known threats, ML for behavioral anomalies and novel variants.
How do you operationalize a machine learning phishing detection model in a production SOC?
Start with offline validation: train on labeled phishing and legitimate email samples, evaluate precision/recall on a holdout set, and tune the decision threshold based on analyst capacity (a precision of 0.90 at the operating point is more useful than maximizing AUC). Then integrate into your email pipeline as a scoring service: serialize the trained model with joblib or pickle, expose it via a REST endpoint, and attach the score as an email header or SOAR enrichment field. Retrain on a schedule using confirmed true/false positives from analyst feedback. Track model drift by monitoring score distributions over time.
What ML algorithms work best for SQL injection detection in web logs?
For SQL injection detection in HTTP request logs, start with feature engineering: request length, ratio of SQL keywords to total tokens, presence of comment sequences (-- and /**/), URL encoding percentage, and character entropy. A Random Forest classifier trained on labeled examples from datasets like CSIC 2010 HTTP or custom labeled logs from your WAF is a solid baseline. For sequence-aware detection that captures multi-request injection attempts, an LSTM or 1D CNN over tokenized request sequences improves coverage. Tune the contamination or decision threshold carefully: high false positives in a WAF context block legitimate traffic.

Interested in this course?

Contact us for scheduling, custom corporate training, or conference availability.

Request This Course