Phishing + AI — Step-by-step Lab
Follow this guided lab to take a phishing dataset through preprocessing, feature engineering, baseline training (RandomForest), evaluation, and SHAP-based explanation. Includes copyable code snippets and a downloadable sample CSV so you can run everything in Colab or locally.
Quick start
- Open the Colab notebook or download the sample CSV.
- If you have a full dataset (Kaggle/UCI/OpenPhish/PhishTank), upload it to Colab at /content/phishing.csv.
- Run cells from top to bottom; the notebook contains fallback logic to use the sample CSV if no dataset is provided.
Step-by-step workflow
Step 1 — Dataset acquisition 10–20 min
Get the CSV you will use for the lab. Use the sample provided below if you don't have access to the full Kaggle dataset.
Loading snippet...
Step 2 — Exploratory Data Analysis (EDA) 20–30 min
Inspect structure, missing values, and class balance.
Loading snippet...
Step 3 — Preprocessing (pipelines) 30–45 min
Build a ColumnTransformer and Pipeline to impute, encode and scale features reproducibly.
Loading snippet...
Step 4 — Feature engineering (URL features) 45–60 min
Extract URL tokens, lengths, special-char flags, entropy, and other domain knowledge features.
Loading snippet...
Step 5 — Train/test split 5–10 min
Use stratified split to preserve class distribution; fix random_state for reproducibility.
Loading snippet...
Step 6 — Train baseline model 15–30 min
Train a RandomForest baseline using the preprocessing pipeline.
Loading snippet...
Step 7 — Evaluation & metrics 15 min
Compute classification report, confusion matrix, and PR-AUC; review FP/FN impact.
Loading snippet...
Step 8 — Interpretation (SHAP) 30–45 min
Run SHAP to generate summary plots and inspect individual predictions to build analyst trust.
Loading snippet...
Steps 9–12 — Tuning, robustness, reproducibility, deployment
- Hyperparameter tuning (RandomizedSearchCV / Optuna) — 30–60 min. Use PR-AUC or a cost-sensitive objective aligned with SOC needs.
- Robustness & error analysis (FP/FN study, simulated evasion) — 30–60 min. Create perturbation tests (shortened URLs, token swaps).
- Reproducibility: environment spec, seeds, data checksums — 15–30 min. Save requirements.txt and metrics in experiments/.
- Optional deployment: model export + simple FastAPI wrapper — variable. Return probability and top-k explanation features for triage tools.
Resources, expected outputs & troubleshooting
Expected outputs
- Confusion matrix image (saved as PNG)
- Classification report with Precision/Recall/F1
- Feature importance table and SHAP summary plot
- Saved model artifact (joblib/pickle) and experiments metadata
Common issues & troubleshooting
- CSV read errors: check encoding and delimiter (use encoding='utf-8', sep=',')
- Memory/time issues: reduce n_estimators or sample dataset
- SHAP heavy: run on a sample of test rows
- Label mismatch: confirm label column name and mapping (1 vs 'phishing')
Reproducibility checklist
- requirements.txt or environment.yml
- random_state seeds used for splits and models
- data provenance & checksums (sha256)
- recorded metrics and saved artifacts under experiments/