Phishing + AI — Step-by-step Lab

Follow this guided lab to take a phishing dataset through preprocessing, feature engineering, baseline training (RandomForest), evaluation, and SHAP-based explanation. Includes copyable code snippets and a downloadable sample CSV so you can run everything in Colab or locally.

Open notebook in Colab UCI Phishing dataset

Estimated time: 90–120 minutes (depends on dataset size)

Requirements: Python (Colab recommended), pandas, scikit-learn, shap, matplotlib, seaborn

Quick start

Open the Colab notebook or download the sample CSV.
If you have a full dataset (Kaggle/UCI/OpenPhish/PhishTank), upload it to Colab at /content/phishing.csv.
Run cells from top to bottom; the notebook contains fallback logic to use the sample CSV if no dataset is provided.

Step-by-step workflow

Step 1 — Dataset acquisition 10–20 min

Get the CSV you will use for the lab. Use the sample provided below if you don't have access to the full Kaggle dataset.

Purpose: obtain a dataset you will analyze and model. You should confirm the file format (CSV), encoding (UTF-8 preferred), and the presence of a usable label column (recommended name: label, values 1=phishing, 0=safe).

Checklist: dataset file (CSV), label column, URL column if available, and note column names.

Tips: if you have the Kaggle dataset, use the Kaggle API in Colab; for quick practice use the provided sample CSV. Do not open raw URLs from the data on an unprotected machine.

Loading snippet...

Step 2 — Exploratory Data Analysis (EDA) 20–30 min

Inspect structure, missing values, and class balance.

Purpose: understand data quality and distribution before modeling. EDA tells you whether the dataset needs cleaning, whether labels are imbalanced, and highlights suspicious or missing values.

Checklist: shape, columns list, label counts, null counts, sample rows, unique values in key categorical columns.

Typical outputs: a table of missing counts, a bar chart of label distribution, and representative sample rows for manual inspection.

Tips: if label values are strings (e.g., 'phishing'/'legit') map them to 1/0 before training.

Loading snippet...

Step 3 — Preprocessing (pipelines) 30–45 min

Build a ColumnTransformer and Pipeline to impute, encode and scale features reproducibly.

What is preprocessing: converting raw columns into numeric inputs the model can use (handle missing values, encode categorical flags, scale numerical features) and doing it via a Pipeline ensures reproducibility and easy deployment.

Why: models expect numeric arrays; inconsistent handling (e.g., different encodings during train/test) causes data leakage or runtime errors.

Checklist: choose imputation strategy (median for numeric, most frequent for categorical), choose scaling (StandardScaler for many tree/linear models optional), and one-hot encoding for categorical/binary flags when needed.

Tips: build a dynamic preprocessor that only includes columns present in the dataset (prevents errors if schema differs). Use ColumnTransformer + Pipeline.

Pitfalls: do not fit preprocessors on full data (fit on training only) to avoid leakage.

Loading snippet...

Step 4 — Feature engineering (URL features) 45–60 min

Extract URL tokens, lengths, special-char flags, entropy, and other domain knowledge features.

Purpose: transform raw URL strings into predictive features (e.g., url_length, num_dashes, has_https_token) that capture signals used by ML models to detect phishing.

Common features: length measures, token counts, presence of suspicious tokens (login, verify), domain age/TLS if available, shortening service flags.

Why: these engineered features often give strong signals that simple models can exploit; feature engineering is frequently the most impactful step in tabular security ML.

Tips: implement extraction as functions that produce a small DataFrame, then merge into the main df; keep feature logic deterministic for reproducibility.

Pitfalls: avoid leaking features derived from future information (timestamps, labels), and remember to sanitize/avoid exposing PII when sharing data.

Loading snippet...

Step 5 — Train/test split 5–10 min

Use stratified split to preserve class distribution; fix random_state for reproducibility.

Purpose: create independent train and test sets to measure generalization. Stratify by the label to ensure class ratios remain similar in both sets.

Checklist: set random_state, use stratify=y, choose test_size (0.2 typical) and consider a separate validation fold for tuning.

Tips: if dataset is small, use cross-validation with stratified folds instead of a single split.

Loading snippet...

Step 6 — Train baseline model 15–30 min

Train a RandomForest baseline using the preprocessing pipeline.

Purpose: get a reliable, interpretable performance baseline. Tree-based models (RandomForest, XGBoost) are strong off-the-shelf choices for tabular features.

Checklist: wrap preprocessor and estimator in a single Pipeline so transforms are applied consistently during predict/score.

Tips: set n_estimators smaller for fast iteration, then increase for final runs; fix seed for reproducibility.

Pitfalls: training on the full dataset (including test) or re-fitting preprocessors on test data will leak information and produce over-optimistic results.

Loading snippet...

Step 7 — Evaluation & metrics 15 min

Compute classification report, confusion matrix, and PR-AUC; review FP/FN impact.

Purpose: quantify model performance and understand operational trade-offs. For imbalanced classes, Precision, Recall, F1, and PR-AUC are more informative than accuracy.

Checklist: compute classification_report, confusion_matrix, plot PR curve, and compute PR-AUC.

Operational note: decide acceptable false positive rate vs. missed phishing; tune threshold accordingly and report precision@recall thresholds meaningful to SOC workflows.

Tips: save metrics and confusion matrix images for experiment logs (experiments/run_x/).

Loading snippet...

Step 8 — Interpretation (SHAP) 30–45 min

Run SHAP to generate summary plots and inspect individual predictions to build analyst trust.

Purpose: explain model predictions at global and local levels. SHAP assigns contribution values to features for each prediction making it easier for analysts to triage alerts.

Checklist: use TreeExplainer for tree models, compute on a sample to limit memory, create SHAP summary and force plots for key FP/FN examples.

Tips: use SHAP to generate reports for analysts (top contributing features and example explanations).

Pitfalls: SHAP can be slow and memory heavy—sample test rows or use approximate methods if necessary.

Loading snippet...

Steps 9–12 — Tuning, robustness, reproducibility, deployment

Hyperparameter tuning (RandomizedSearchCV / Optuna) — 30–60 min. Use PR-AUC or a cost-sensitive objective aligned with SOC needs.
Robustness & error analysis (FP/FN study, simulated evasion) — 30–60 min. Create perturbation tests (shortened URLs, token swaps).
Reproducibility: environment spec, seeds, data checksums — 15–30 min. Save requirements.txt and metrics in experiments/.
Optional deployment: model export + simple FastAPI wrapper — variable. Return probability and top-k explanation features for triage tools.

Resources, expected outputs & troubleshooting

Expected outputs

Confusion matrix image (saved as PNG)
Classification report with Precision/Recall/F1
Feature importance table and SHAP summary plot
Saved model artifact (joblib/pickle) and experiments metadata

Common issues & troubleshooting

CSV read errors: check encoding and delimiter (use encoding='utf-8', sep=',')
Memory/time issues: reduce n_estimators or sample dataset
SHAP heavy: run on a sample of test rows
Label mismatch: confirm label column name and mapping (1 vs 'phishing')

Reproducibility checklist

requirements.txt or environment.yml
random_state seeds used for splits and models
data provenance & checksums (sha256)
recorded metrics and saved artifacts under experiments/