Data Science Techniques That Speed Up Incident Response

By Charles Givre · May 1, 2026

data scienceincident responsemachine learningPythonforensicsSOC

When you’re three hours into an incident with three hundred thousand log lines, “look at the logs” is not an action plan. Data science techniques exist to reduce that problem to something tractable.

This isn’t about replacing IR tools. It’s about augmenting them with analysis patterns that handle scale, identify structure in noisy data, and compress the time between “data dump” and “here’s what happened.”

Timeline Reconstruction with Pandas

Building a complete attack timeline is often the first priority in IR. Evidence comes from multiple sources: Windows Security events, Zeek connection logs, Sysmon events, file system timestamps. Getting them into a single chronological view manually is error-prone.

pandas handles this well. The key is normalizing timestamps to UTC and merging sources on time:

import pandas as pd
from evtx import PyEvtxParser
import json

def load_windows_events(path, event_ids=None):
    parser = PyEvtxParser(path)
    records = [json.loads(r['data']) for r in parser.records_json()]
    df = pd.json_normalize(records)
    df['timestamp'] = pd.to_datetime(
        df['Event.System.TimeCreated.#attributes.SystemTime'], utc=True
    )
    if event_ids:
        df = df[df['Event.System.EventID'].isin(event_ids)]
    return df

def load_zeek_conn(path):
    with open(path) as f:
        for line in f:
            if line.startswith('#fields'):
                cols = line.strip().split('\t')[1:]
                break
    df = pd.read_csv(path, sep='\t', comment='#', names=cols, na_values=['-', '(empty)'])
    df['timestamp'] = pd.to_datetime(df['ts'], unit='s', utc=True)
    return df

events = pd.concat([
    load_windows_events('Security.evtx', event_ids=[4624, 4625, 4688]).assign(source='windows'),
    load_zeek_conn('conn.log').assign(source='zeek'),
], ignore_index=True).sort_values('timestamp')

The source column preserves which log each event came from. Sort ascending and you have a cross-source timeline where credential logons (Event ID 4624) appear alongside the network connections they correspond to.

The common failure: mixing naive (no timezone) and tz-aware timestamps. Force UTC on every source at load time to avoid merge errors later.

During triage, you often need to group a large number of related artifacts: commands executed, IPs contacted, file paths modified. Clustering finds structure that manual review misses at scale.

Suppose you pull a list of command-line executions from Sysmon Event ID 1 (MITRE ATT&CK T1059) and need to identify distinct malware families or attacker toolsets within them. TF-IDF vectors plus DBSCAN cluster similar commands without requiring a predefined number of clusters:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import normalize

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), max_features=500)
X = vectorizer.fit_transform(df_cmds['cmdline'].fillna(''))
X_normalized = normalize(X)

db = DBSCAN(eps=0.3, min_samples=2, metric='cosine')
df_cmds['cluster'] = db.fit_predict(X_normalized)

for cluster_id in sorted(df_cmds['cluster'].unique()):
    print(f"\nCluster {cluster_id}:")
    print(df_cmds[df_cmds['cluster'] == cluster_id]['cmdline'].head(5).to_string())

eps=0.3 on cosine distance controls how similar two commands need to be to belong to the same cluster. Cluster -1 is DBSCAN’s noise label for points that don’t group with anything, which is often where the most unusual activity lives: attacker tooling that appeared once and doesn’t resemble anything else in the dataset.

The same pattern applies to network activity: cluster destination IPs by shared ASN and reverse DNS patterns to separate C2 infrastructure from legitimate traffic, or cluster DNS queries by character entropy to identify DGA domain families (MITRE ATT&CK T1568.002).

NLP for Log Search at Scale

During IR, you often need to answer specific questions against log data that isn’t well-indexed: find any reference to this hostname across all log sources, or find commands that resemble known credential-dumping patterns.

For structured logs with machine-parseable fields, SQL-style filtering works. For free-form log text (application logs, bash history, webserver access logs), TF-IDF similarity lets you find relevant entries against a natural-language query without requiring exact string matches:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# log_lines: list of strings, one per log entry
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 4))
corpus_vectors = vectorizer.fit_transform(log_lines)

def search_logs(query, top_n=20):
    query_vec = vectorizer.transform([query])
    scores = cosine_similarity(query_vec, corpus_vectors)[0]
    top_indices = np.argsort(scores)[::-1][:top_n]
    return [(log_lines[i], round(scores[i], 3)) for i in top_indices if scores[i] > 0]

results = search_logs("certutil download base64 decode", top_n=25)

Character-level n-grams (char_wb, ngram_range=(3, 4)) are more tolerant of obfuscation than word-level tokenization. An attacker using cert util with a space, or CeRtUtIl with mixed case, still produces character trigrams that overlap with the query.

This doesn’t replace a SIEM with proper full-text indexing. It’s for working with log archives that aren’t in your SIEM, with log types your SIEM can’t parse, or in environments where your normal toolchain isn’t accessible.

When Notebooks Become Evidence

Jupyter notebooks used during IR are analysis artifacts that can become case evidence. Document analytical decisions inside cells: why you applied a specific filter, what a cluster ID represents, which IOCs you excluded and why. Future analysts and legal counsel will need to follow your reasoning.

When converting findings to a report for stakeholders, nbconvert exports the notebook including all output:

jupyter nbconvert --to html ir_analysis_2026-05-01.ipynb --output-dir ./reports/

Keep both the raw notebook and the exported HTML. The HTML is for sharing; the notebook preserves the analysis logic for follow-up questions.

What This Doesn’t Replace

These techniques are force multipliers, not substitutes for forensic tools. They don’t replace Autopsy, Volatility, or Plaso. The pattern is: Plaso builds the timeline, pandas lets you filter and analyze it; Volatility extracts memory artifacts, Python processes what Volatility extracts.

The gap most IR teams have isn’t in forensic tooling. It’s in analyzing data at scale once it’s collected. That’s where data science skills pay off in IR work.

GTK Cyber’s applied data science training covers these techniques hands-on, with labs built around realistic IR datasets and scenarios practitioners encounter in real investigations.

Frequently Asked Questions

How do you handle log sources with different timestamp formats when building an IR timeline?
Force UTC on every source at load time. Windows Security event timestamps come in ISO 8601 format inside the XML; Zeek conn.log timestamps are Unix epoch floats; syslog timestamps may lack year information entirely. Use pd.to_datetime() with utc=True on every load function. Any naive timestamp in a pandas concat will raise a TypeError when mixed with tz-aware columns, and that error is much easier to catch at load time than after merging four sources. Document the timezone assumption in a comment at the top of the notebook so it's clear when the notebook becomes evidence.
What DBSCAN parameters work for clustering command-line arguments in IR?
Start with eps=0.3 and min_samples=2 using cosine distance on TF-IDF vectors. Lower eps values (0.1-0.2) produce tighter clusters with more noise points (cluster -1); higher values (0.4-0.5) merge dissimilar commands into the same cluster. min_samples=2 makes sense for IR where even two similar unusual commands are worth grouping. The noise cluster (-1) is often the most analytically interesting: it contains commands that appear only once and don't group with anything, which is exactly where attacker one-off tooling tends to land.
When should I use TF-IDF similarity search instead of SIEM search during IR?
When the data isn't in your SIEM, or when your SIEM doesn't handle the log type. SIEM full-text search is faster and more operationally reliable for production log analysis. TF-IDF similarity is useful when you're working with log archives exported from a system you don't control, with application log formats your SIEM can't parse, or in an air-gapped environment where your normal toolchain isn't available. It's also useful for fuzzy matching: searching for commands that resemble known attacker patterns even when the exact string doesn't match, because character n-gram vectorization tolerates obfuscation through spacing, case changes, and argument reordering.

Want to learn more?

Explore our hands-on AI and cybersecurity training courses.

View Courses