Data Mining Bias in Python

The statistical inflation of strategy performance metrics caused by exhaustive searching through data until spurious patterns are found.

Definition

Data Mining Bias (also called p-hacking, multiple comparisons bias, or the multiple testing problem) is the phenomenon where repeated testing of different hypotheses on the same dataset dramatically inflates the probability of finding spurious results by pure chance. In quantitative trading, it arises when a researcher tests dozens of indicators, asset classes, time periods, and parameter sets on the same historical data. With enough trials, random noise will inevitably produce what appears to be a profitable pattern — but the 'discovery' is a statistical artifact of exhaustive search, not a genuine market inefficiency.

Quantitative Formula

P(\text{at least one false positive}) = 1 - (1 - \alpha)^m

Where $\alpha$ is the significance level for a single test (typically 0.05) and $m$ is the number of independent tests performed. For $m = 20$ tests at $\alpha = 0.05$ , the probability of at least one false positive is $1 - 0.95^{20} \approx 64\%$ . The Bonferroni correction addresses this by using an adjusted significance threshold of $\alpha^* = \alpha / m$ , ensuring that the family-wise error rate is controlled at level $\alpha$ .

Why It Matters in Backtesting

The quantitative finance literature suffers from a severe data mining bias problem — a 2015 paper by Harvey, Liu & Zhu catalogued over 300 published 'factors' and found that most had Sharpe Ratios inflated by multiple testing. In practice, a researcher who tests 100 parameter combinations should apply the Bonferroni correction and require a t-statistic of at least 3.0 rather than the standard 1.96. Any backtesting workflow must maintain a strict log of every hypothesis tested — including those that failed — to calculate the correct multiple testing adjustment.

Python Implementation

import numpy as np
    import pandas as pd
    from scipy import stats

    def multiple_testing_correction(results: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame:
        """
        Applies Bonferroni and Benjamini-Hochberg corrections to a set of strategy test results.
        results: DataFrame with columns ['strategy_name', 'sharpe_ratio', 'p_value', 'n_trades'].
        """
        results = results.copy().sort_values("p_value").reset_index(drop=True)
        m = len(results)
        # Bonferroni correction (conservative)
        results["bonferroni_threshold"] = alpha / m
        results["bonferroni_significant"] = results["p_value"] < results["bonferroni_threshold"]
        # Benjamini-Hochberg correction (FDR control, less conservative)
        results["bh_threshold"] = (results.index + 1) / m * alpha
        bh_max_rank = results[results["p_value"] <= results["bh_threshold"]].index.max()
        results["bh_significant"] = results.index <= bh_max_rank if not pd.isna(bh_max_rank) else False
        # Adjusted required t-stat for Sharpe under multiple testing
        results["adjusted_t_stat_threshold"] = stats.norm.ppf(1 - (alpha / m) / 2)
        false_positive_prob = 1 - (1 - alpha) ** m
        results.attrs["family_wise_false_positive_probability"] = false_positive_prob
        results.attrs["n_strategies_tested"] = m
        return results

Test this in a live environment

Stop running Jupyter notebooks locally. Paste this Data Mining Bias code directly into Valetha's Strategy Lab and run a full historical backtest in seconds.