If a trading strategy seems to backtest successfully, why doesn’t it always work in live trading?

It’s widely acknowledged that a strategy that worked in the past may not work in the future. Market conditions change, other participants change their algorithms, adapt to your attempts to pick them off, and so on. This means you need to continually monitor and adjust even profitable strategies.

But there’s something even more problematic about backtesting strategies, which fewer people understand clearly. This is that a profitable backtest does not prove that a strategy “worked”, even in the past. This is because most backtests do not achieve any kind of “statistical significance”.

As everyone knows, it’s trivial to tailor a strategy that works beautifully on any given piece of historical data. It’s easy to contrive a strategy that fits the idiosyncratic features of a particular historical dataset, and then show that it is profitable when backtested. But when no mechanism actually exists relating the signal to future movement, the strategy will fail in live testing.

So how does one tell the difference? How can one show that a backtest is not only profitable, but statistically significant?

#### Statistical hypothesis testing in trading

If you’ve studied some basic statistics, you’ve probably heard of hypothesis testing.

In hypothesis testing, it’s not enough for a model to fit the data. It’s got to fit the data in a way that is “statistically significant”. This means that it’s unlikely that the model would fit the data to the extent that it does, by chance or for any other reason than that the model really is valid. The only way for the model not to be valid is to invoke an “unlikely coincidence”.

One proposes some hypothesis about the data, and then considers the probability (called the p-value) that the apparent agreement between the data and the hypothesis occurred by chance. By convention, if the p-value is less than 5%, the hypothesis is considered statistically significant.

It’s worthwhile to place backtesting within this framework of hypothesis testing to help understand what, if anything, we can really conclude from a given backtest.

#### Coin toss trading

Let’s keep it simple to start with. Let’s suppose we have an algorithm which predicts, at time steps \(t_1,…,t_n\), whether the asset will subsequently move up (change \(\geq 0\)) or down (change \(<0\)) over some time interval \(\Delta T\). We then run a backtest and find that our algorithm was right some fraction \(0 \leq x \leq 1\) of the time.

If our algorithm was right more than half of the time during the backtest, what’s the probability that our algorithm was right only by chance? This is calculated using the binomial distribution. To see some numbers, let’s suppose our algorithm makes 20 predictions (\(n=20\)) and is right for 12 of them. The probability of this happening entirely by chance is about 25%. If it’s right for 14 of them, the probability of this happening by chance is about 5.8%. This is approaching statistical significance according to convention. The idea is that it’s “unlikely” that our strategy is right by chance, therefore the mechanism proposed by the strategy is likely correct. So if our algorithm got 15 or more correct during the backtest, we’re in the money, right? Not so fast.

To take an extreme example, let’s suppose that our piece of historical data was a spectacular bitcoin bull run that went up 20 times in a row. And let’s suppose that our strategy is “Bitcoin only goes up!” Then our calculation above would prove that the strategy works with a statistical significance of 0.0001%! What’s gone wrong here?

When calculating the p-value for a linear regression, standard statistics usually assumes that the “noise” in the data is random and normally distributed. One mistake we have made in the above analysis is assuming that the actual price trajectory is like a coin toss – equally likely to go up or down. But market movements are not random. They can, for example, be highly autocorrelated. And they can go up in a highly non-random way for quite some time, before turning around and going down.

Secondly, we presumably looked at the data before deciding on the strategy. If you’re allowed to look at the data first, it’s easy to contrive a strategy that exactly matches what the data happened to do. In this case, it’s not “unlikely” that our strategy is profitable by mere coincidence, because we simply chose the strategy that we could see matched the data.

Another thing that can destroy statistical validity is testing multiple models. Suppose a given model has a p value of 0.05, that is, it has only a 5% chance of appearing correct by chance. But now suppose you test 20 different models. Suddenly it’s quite likely that one of them will backtest successfully by chance alone. This sort of scenario can easily arise when one tests their strategy for many different choices of parameter, and chooses the one that works. This is why strategy “optimization” needs to be done carefully.

#### So how do you backtest successfully?

In practice, we wouldn’t be checking whether the asset goes up or down. Instead, we’d likely check, across all pairs of buy and sell decisions, whether the sellprice minus the buyprice amounted to a profit greater than buy and hold. We would then ask, what is the probability that this apparent fit occurred by chance, and the strategy doesn’t really work? If it seems unlikely that the observed fit could be a coincidence, we may be onto a winner.

On the other hand, a trader may have some external or pre-existing reason for believing that a strategy could work. In this case, he/she may not require the same degree of statistical significant. This is analogous to Bayesian statistics where one incorporates a prior belief into their statistical analysis.

Now, HFT (high frequency trading) backtests can often achieve statistical significant much more easily because of the large amount of data and the large number of buy/sell decisions in a short space of time. More pedestrian strategies will have a harder time.

#### So does machine learning work for trading?

People often ask whether machine learning techniques are effective for developing trading strategies. The answer is: it depends on how they’re applied. When machine learning models are fit to data, they produce certain “p-value” statistics which are vulnerable to all the issues we’ve discussed. Therefore, some care is needed to ensure the models are in fact statistically significant.