<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>David Maguire</title>
<link>https://davidmaguire.ai/research.html</link>
<atom:link href="https://davidmaguire.ai/research.xml" rel="self" type="application/rss+xml"/>
<description>Longer write-ups — maths, runnable code, and honest limitations.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Sat, 20 Jun 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>How Backtest Selection Inflates the Sharpe Ratio</title>
  <dc:creator>David Maguire</dc:creator>
  <link>https://davidmaguire.ai/research/backtest-overfitting/</link>
  <description><![CDATA[ 




<section id="the-problem-in-one-sentence" class="level2">
<h2 class="anchored" data-anchor-id="the-problem-in-one-sentence">The problem in one sentence</h2>
<p>If you search over enough strategies, you are guaranteed to find one with an excellent in-sample Sharpe ratio — <em>even when none of them has any real edge</em> — because you are reporting the maximum of a large number of noisy estimates, and the maximum of noise is not zero.</p>
<p>This is the single most common way a backtest lies. Below I make it concrete with a simulation, derive how fast the problem grows with the number of trials, and summarise the standard correction.</p>
</section>
<section id="a-minimal-simulation" class="level2">
<h2 class="anchored" data-anchor-id="a-minimal-simulation">A minimal simulation</h2>
<p>Take <img src="https://latex.codecogs.com/png.latex?N"> candidate strategies. Give every one of them <strong>zero true edge</strong>: daily returns drawn i.i.d. from <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BN%7D(0,%20%5Csigma%5E2)"> with mean exactly zero. Nothing here can predict anything. Now compute each strategy’s annualised in-sample Sharpe and keep the best — exactly what a naive strategy search does.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-2"></span>
<span id="cb1-3">RNG <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.default_rng(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb1-4">A, T <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">252</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1260</span>          <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 252 trading days/yr; T = 5 years of daily data</span></span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ann_sharpe(returns):  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># returns shape (T, N) -&gt; (N,)</span></span>
<span id="cb1-7">    mu <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> returns.mean(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb1-8">    sd <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> returns.std(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, ddof<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> np.sqrt(A) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> mu <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> sd</span>
<span id="cb1-10"></span>
<span id="cb1-11">N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb1-12">sr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ann_sharpe(RNG.standard_normal((T, N)))   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># every strategy has TRUE Sharpe 0</span></span>
<span id="cb1-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(sr.mean(), sr.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">max</span>())   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ~0.01,  1.76</span></span></code></pre></div></div>
<p>The average strategy sits at a Sharpe of ~0, as it should. But the <strong>best</strong> of the thousand posts an annualised Sharpe of <strong>1.76</strong> over a five-year backtest — the kind of number that gets a strategy funded — purely from sampling noise.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://davidmaguire.ai/research/backtest-overfitting/fig-sharpe-distribution.png" class="img-fluid figure-img"></p>
<figcaption>Distribution of annualised in-sample Sharpe across 1,000 strategies with zero true edge. The average is ~0; the cherry-picked maximum (red) is 1.76.</figcaption>
</figure>
</div>
</section>
<section id="why-it-scales-like-2-ln-n" class="level2">
<h2 class="anchored" data-anchor-id="why-it-scales-like-2-ln-n">Why it scales like √(2 ln N)</h2>
<p>There’s a clean way to see how bad this gets. Under the null of zero edge, the estimated Sharpe over <img src="https://latex.codecogs.com/png.latex?T"> i.i.d. observations is approximately normal,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cwidehat%7BSR%7D%20%5C;%5Csim%5C;%20%5Cmathcal%7BN%7D%5C!%5Cleft(0,%5C;%20%5Ctfrac%7B1%7D%7BT%7D%5Cright)%0A%5Cquad%5Ctext%7B(per-period)%7D,%5Cqquad%0A%5Cwidehat%7BSR%7D_%7B%5Ctext%7Bann%7D%7D%20%5C;%5Csim%5C;%20%5Cmathcal%7BN%7D%5C!%5Cleft(0,%5C;%20%5Ctfrac%7BA%7D%7BT%7D%5Cright),%0A"></p>
<p>so the standard error of an annualised Sharpe is <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7BA/T%7D">. For <img src="https://latex.codecogs.com/png.latex?T=1260"> that is <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B252/1260%7D%5Capprox%200.45"> — already a wide band around zero.</p>
<p>Now the punchline. The expected maximum of <img src="https://latex.codecogs.com/png.latex?N"> i.i.d. standard normals grows like <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B2%5Cln%20N%7D">. Mapping back to annualised Sharpe units, the best spurious Sharpe from <img src="https://latex.codecogs.com/png.latex?N"> independent trials scales as</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D%5C!%5Cleft%5B%5Cmax_%7Bi%5Cle%20N%7D%5Cwidehat%7BSR%7D_%7B%5Ctext%7Bann%7D%7D%5E%7B(i)%7D%5Cright%5D%0A%5C;%5Capprox%5C;%20%5Csqrt%7B%5Ctfrac%7BA%7D%7BT%7D%7D%5C;%5Csqrt%7B2%5Cln%20N%7D.%0A"></p>
<p>It grows without bound in <img src="https://latex.codecogs.com/png.latex?N"> — slowly, but relentlessly. Simulating the mean best Sharpe across many draws tracks this curve closely:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://davidmaguire.ai/research/backtest-overfitting/fig-max-sharpe-vs-trials.png" class="img-fluid figure-img"></p>
<figcaption>Mean best-of-N annualised Sharpe from pure noise (blue) versus the √(2 ln N)·√(A/T) approximation (red). More search buys a better-looking strategy, with no edge anywhere.</figcaption>
</figure>
</div>
<p>Put in a table, the expected best Sharpe from <strong>zero-edge</strong> strategies over five years of daily data is:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th style="text-align: right;">Strategies tested <img src="https://latex.codecogs.com/png.latex?N"></th>
<th style="text-align: center;">Expected best annualised Sharpe</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">10</td>
<td style="text-align: center;">~0.6</td>
</tr>
<tr class="even">
<td style="text-align: right;">100</td>
<td style="text-align: center;">~1.1</td>
</tr>
<tr class="odd">
<td style="text-align: right;">1,000</td>
<td style="text-align: center;">~1.4</td>
</tr>
<tr class="even">
<td style="text-align: right;">10,000</td>
<td style="text-align: center;">~1.7</td>
</tr>
</tbody>
</table>
<p>A grad student trying a few hundred feature combinations is already in Sharpe-1 territory before adding a single unit of real signal.</p>
</section>
<section id="the-correction-the-deflated-sharpe-ratio" class="level2">
<h2 class="anchored" data-anchor-id="the-correction-the-deflated-sharpe-ratio">The correction: the Deflated Sharpe Ratio</h2>
<p>The fix is not to distrust all backtests — it’s to judge an observed Sharpe <em>against the distribution of the best you’d expect under the null, given how many things you tried.</em> Bailey and López de Prado formalise this as the <strong>Deflated Sharpe Ratio (DSR)</strong>. The key input is the expected maximum Sharpe across <img src="https://latex.codecogs.com/png.latex?N"> trials,</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D%5C!%5Cleft%5B%5Cmax_n%20SR%5Cright%5D%0A%5Capprox%20%5Csqrt%7B%5Coperatorname%7BVar%7D(SR_n)%7D%5C,%0A%5CBig%5B(1-%5Cgamma)%5C,%5CPhi%5E%7B-1%7D%5C!%5Cleft(1-%5Ctfrac1N%5Cright)%0A+%20%5Cgamma%5C,%5CPhi%5E%7B-1%7D%5C!%5Cleft(1-%5Ctfrac1%7BN%20e%7D%5Cright)%5CBig%5D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cgamma%5Capprox%200.5772"> is the Euler–Mascheroni constant, <img src="https://latex.codecogs.com/png.latex?%5CPhi%5E%7B-1%7D"> is the inverse normal CDF, and <img src="https://latex.codecogs.com/png.latex?%5Coperatorname%7BVar%7D(SR_n)"> is the variance of Sharpe estimates <em>across the trials you ran</em>. The DSR then asks how far your observed Sharpe sits above this benchmark, correcting for sample length, and for the skew and kurtosis of the strategy’s returns (both of which make extreme Sharpes more likely than the normal approximation suggests). My <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B2%5Cln%20N%7D"> expression above is just the leading-order version of the bracket.</p>
<p>In practice the corrections that matter are mundane:</p>
<ul>
<li><strong>Count your trials honestly.</strong> Every feature, parameter, and universe you tried is a trial — including the ones you discarded. The effective <img src="https://latex.codecogs.com/png.latex?N"> is almost always larger than you think.</li>
<li><strong>Hold out data you never touch</strong>, and treat a strategy as unproven until it survives a clean out-of-sample window.</li>
<li><strong>Prefer fewer, hypothesis-driven tests</strong> to brute-force search. Lower <img src="https://latex.codecogs.com/png.latex?N"> is the cheapest variance reduction available.</li>
<li><strong>Report the deflated number</strong>, not the cherry-picked one.</li>
</ul>
</section>
<section id="limitations-of-this-demonstration" class="level2">
<h2 class="anchored" data-anchor-id="limitations-of-this-demonstration">Limitations of this demonstration</h2>
<p>This is a deliberately clean caricature, and it overstates realism in two directions. The trials here are <strong>independent</strong>; real candidate strategies are highly correlated, so the <em>effective</em> number of independent trials is smaller than the raw count — the honest <img src="https://latex.codecogs.com/png.latex?N"> for the DSR is nearer the number of independent bets. Conversely, real return series have <strong>fat tails and serial dependence</strong>, which the i.i.d. normal draw ignores and which push extreme Sharpes <em>higher</em> than shown. The <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7B2%5Cln%20N%7D"> result is also asymptotic; at small <img src="https://latex.codecogs.com/png.latex?N"> the finite-sample expected maximum sits a little below it. None of this changes the qualitative conclusion — it only shifts where the curve sits.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li>Bailey, D. H., &amp; López de Prado, M. (2014). <em>The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality.</em> Journal of Portfolio Management, 40(5).</li>
<li>Bailey, D. H., Borwein, J., López de Prado, M., &amp; Zhu, Q. J. (2017). <em>The Probability of Backtest Overfitting.</em> Journal of Computational Finance, 20(4).</li>
<li>Harvey, C. R., &amp; Liu, Y. (2015). <em>Backtesting.</em> Journal of Portfolio Management, 42(1).</li>
<li>Harvey, C. R., Liu, Y., &amp; Zhu, H. (2016). <em>… and the Cross-Section of Expected Returns.</em> Review of Financial Studies, 29(1).</li>
</ul>
<div class="callout callout-style-simple callout-note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-body-container">
<p>Full script: <a href="https://github.com/davidolivermaguire-ai/davidmaguire.ai/blob/main/code/backtest_overfitting.py"><code>code/backtest_overfitting.py</code></a>. Reproduce the figures with <code>python code/backtest_overfitting.py</code>.</p>
</div>
</div>
</div>


</section>

 ]]></description>
  <category>backtesting</category>
  <category>statistics</category>
  <category>overfitting</category>
  <guid>https://davidmaguire.ai/research/backtest-overfitting/</guid>
  <pubDate>Sat, 20 Jun 2026 00:00:00 GMT</pubDate>
  <media:content url="https://davidmaguire.ai/research/backtest-overfitting/fig-max-sharpe-vs-trials.png" medium="image" type="image/png" height="88" width="144"/>
</item>
</channel>
</rss>
