OneSecurity — Phishing + AI Lab
End-to-end hands-on workflow: dataset → model → explainability

Phishing + AI — Step-by-step Lab

Follow this guided lab to take a phishing dataset through preprocessing, feature engineering, baseline training (RandomForest), evaluation, and SHAP-based explanation. Includes copyable code snippets and a downloadable sample CSV so you can run everything in Colab or locally.

Estimated time: 90–120 minutes (depends on dataset size)
Requirements: Python (Colab recommended), pandas, scikit-learn, shap, matplotlib, seaborn

Quick start

  1. Open the Colab notebook or download the sample CSV.
  2. If you have a full dataset (Kaggle/UCI/OpenPhish/PhishTank), upload it to Colab at /content/phishing.csv.
  3. Run cells from top to bottom; the notebook contains fallback logic to use the sample CSV if no dataset is provided.

Step-by-step workflow

Step 1 — Dataset acquisition 10–20 min

Get the CSV you will use for the lab. Use the sample provided below if you don't have access to the full Kaggle dataset.

Purpose: obtain a dataset you will analyze and model. You should confirm the file format (CSV), encoding (UTF-8 preferred), and the presence of a usable label column (recommended name: label, values 1=phishing, 0=safe).
Checklist: dataset file (CSV), label column, URL column if available, and note column names.

Tips: if you have the Kaggle dataset, use the Kaggle API in Colab; for quick practice use the provided sample CSV. Do not open raw URLs from the data on an unprotected machine.
Loading snippet...

Step 2 — Exploratory Data Analysis (EDA) 20–30 min

Inspect structure, missing values, and class balance.

Purpose: understand data quality and distribution before modeling. EDA tells you whether the dataset needs cleaning, whether labels are imbalanced, and highlights suspicious or missing values.
Checklist: shape, columns list, label counts, null counts, sample rows, unique values in key categorical columns.
Typical outputs: a table of missing counts, a bar chart of label distribution, and representative sample rows for manual inspection.
Tips: if label values are strings (e.g., 'phishing'/'legit') map them to 1/0 before training.
Loading snippet...

Step 3 — Preprocessing (pipelines) 30–45 min

Build a ColumnTransformer and Pipeline to impute, encode and scale features reproducibly.

What is preprocessing: converting raw columns into numeric inputs the model can use (handle missing values, encode categorical flags, scale numerical features) and doing it via a Pipeline ensures reproducibility and easy deployment.
Why: models expect numeric arrays; inconsistent handling (e.g., different encodings during train/test) causes data leakage or runtime errors.
Checklist: choose imputation strategy (median for numeric, most frequent for categorical), choose scaling (StandardScaler for many tree/linear models optional), and one-hot encoding for categorical/binary flags when needed.
Tips: build a dynamic preprocessor that only includes columns present in the dataset (prevents errors if schema differs). Use ColumnTransformer + Pipeline.
Pitfalls: do not fit preprocessors on full data (fit on training only) to avoid leakage.
Loading snippet...

Step 4 — Feature engineering (URL features) 45–60 min

Extract URL tokens, lengths, special-char flags, entropy, and other domain knowledge features.

Purpose: transform raw URL strings into predictive features (e.g., url_length, num_dashes, has_https_token) that capture signals used by ML models to detect phishing.
Common features: length measures, token counts, presence of suspicious tokens (login, verify), domain age/TLS if available, shortening service flags.
Why: these engineered features often give strong signals that simple models can exploit; feature engineering is frequently the most impactful step in tabular security ML.
Tips: implement extraction as functions that produce a small DataFrame, then merge into the main df; keep feature logic deterministic for reproducibility.
Pitfalls: avoid leaking features derived from future information (timestamps, labels), and remember to sanitize/avoid exposing PII when sharing data.
Loading snippet...

Step 5 — Train/test split 5–10 min

Use stratified split to preserve class distribution; fix random_state for reproducibility.

Purpose: create independent train and test sets to measure generalization. Stratify by the label to ensure class ratios remain similar in both sets.
Checklist: set random_state, use stratify=y, choose test_size (0.2 typical) and consider a separate validation fold for tuning.
Tips: if dataset is small, use cross-validation with stratified folds instead of a single split.
Loading snippet...

Step 6 — Train baseline model 15–30 min

Train a RandomForest baseline using the preprocessing pipeline.

Purpose: get a reliable, interpretable performance baseline. Tree-based models (RandomForest, XGBoost) are strong off-the-shelf choices for tabular features.
Checklist: wrap preprocessor and estimator in a single Pipeline so transforms are applied consistently during predict/score.
Tips: set n_estimators smaller for fast iteration, then increase for final runs; fix seed for reproducibility.
Pitfalls: training on the full dataset (including test) or re-fitting preprocessors on test data will leak information and produce over-optimistic results.
Loading snippet...

Step 7 — Evaluation & metrics 15 min

Compute classification report, confusion matrix, and PR-AUC; review FP/FN impact.

Purpose: quantify model performance and understand operational trade-offs. For imbalanced classes, Precision, Recall, F1, and PR-AUC are more informative than accuracy.
Checklist: compute classification_report, confusion_matrix, plot PR curve, and compute PR-AUC.
Operational note: decide acceptable false positive rate vs. missed phishing; tune threshold accordingly and report precision@recall thresholds meaningful to SOC workflows.
Tips: save metrics and confusion matrix images for experiment logs (experiments/run_x/).
Loading snippet...

Step 8 — Interpretation (SHAP) 30–45 min

Run SHAP to generate summary plots and inspect individual predictions to build analyst trust.

Purpose: explain model predictions at global and local levels. SHAP assigns contribution values to features for each prediction making it easier for analysts to triage alerts.
Checklist: use TreeExplainer for tree models, compute on a sample to limit memory, create SHAP summary and force plots for key FP/FN examples.
Tips: use SHAP to generate reports for analysts (top contributing features and example explanations).
Pitfalls: SHAP can be slow and memory heavy—sample test rows or use approximate methods if necessary.
Loading snippet...

Steps 9–12 — Tuning, robustness, reproducibility, deployment

  1. Hyperparameter tuning (RandomizedSearchCV / Optuna) — 30–60 min. Use PR-AUC or a cost-sensitive objective aligned with SOC needs.
  2. Robustness & error analysis (FP/FN study, simulated evasion) — 30–60 min. Create perturbation tests (shortened URLs, token swaps).
  3. Reproducibility: environment spec, seeds, data checksums — 15–30 min. Save requirements.txt and metrics in experiments/.
  4. Optional deployment: model export + simple FastAPI wrapper — variable. Return probability and top-k explanation features for triage tools.

Resources, expected outputs & troubleshooting

Expected outputs

  • Confusion matrix image (saved as PNG)
  • Classification report with Precision/Recall/F1
  • Feature importance table and SHAP summary plot
  • Saved model artifact (joblib/pickle) and experiments metadata

Common issues & troubleshooting

  • CSV read errors: check encoding and delimiter (use encoding='utf-8', sep=',')
  • Memory/time issues: reduce n_estimators or sample dataset
  • SHAP heavy: run on a sample of test rows
  • Label mismatch: confirm label column name and mapping (1 vs 'phishing')

Reproducibility checklist

  • requirements.txt or environment.yml
  • random_state seeds used for splits and models
  • data provenance & checksums (sha256)
  • recorded metrics and saved artifacts under experiments/