Data Mining Bias in Python
The statistical inflation of strategy performance metrics caused by exhaustive searching through data until spurious patterns are found.
Definition
Data Mining Bias (also called p-hacking, multiple comparisons bias, or the multiple testing problem) is the phenomenon where repeated testing of different hypotheses on the same dataset dramatically inflates the probability of finding spurious results by pure chance. In quantitative trading, it arises when a researcher tests dozens of indicators, asset classes, time periods, and parameter sets on the same historical data. With enough trials, random noise will inevitably produce what appears to be a profitable pattern — but the 'discovery' is a statistical artifact of exhaustive search, not a genuine market inefficiency.
Quantitative Formula
Where is the significance level for a single test (typically 0.05) and is the number of independent tests performed. For tests at , the probability of at least one false positive is . The Bonferroni correction addresses this by using an adjusted significance threshold of , ensuring that the family-wise error rate is controlled at level .
Why It Matters in Backtesting
The quantitative finance literature suffers from a severe data mining bias problem — a 2015 paper by Harvey, Liu & Zhu catalogued over 300 published 'factors' and found that most had Sharpe Ratios inflated by multiple testing. In practice, a researcher who tests 100 parameter combinations should apply the Bonferroni correction and require a t-statistic of at least 3.0 rather than the standard 1.96. Any backtesting workflow must maintain a strict log of every hypothesis tested — including those that failed — to calculate the correct multiple testing adjustment.
Python Implementation
import numpy as np
import pandas as pd
from scipy import stats
def multiple_testing_correction(results: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame:
"""
Applies Bonferroni and Benjamini-Hochberg corrections to a set of strategy test results.
results: DataFrame with columns ['strategy_name', 'sharpe_ratio', 'p_value', 'n_trades'].
"""
results = results.copy().sort_values("p_value").reset_index(drop=True)
m = len(results)
# Bonferroni correction (conservative)
results["bonferroni_threshold"] = alpha / m
results["bonferroni_significant"] = results["p_value"] < results["bonferroni_threshold"]
# Benjamini-Hochberg correction (FDR control, less conservative)
results["bh_threshold"] = (results.index + 1) / m * alpha
bh_max_rank = results[results["p_value"] <= results["bh_threshold"]].index.max()
results["bh_significant"] = results.index <= bh_max_rank if not pd.isna(bh_max_rank) else False
# Adjusted required t-stat for Sharpe under multiple testing
results["adjusted_t_stat_threshold"] = stats.norm.ppf(1 - (alpha / m) / 2)
false_positive_prob = 1 - (1 - alpha) ** m
results.attrs["family_wise_false_positive_probability"] = false_positive_prob
results.attrs["n_strategies_tested"] = m
return resultsTest this in a live environment
Stop running Jupyter notebooks locally. Paste this Data Mining Bias code directly into Valetha's Strategy Lab and run a full historical backtest in seconds.
Open the Python Strategy Lab