Problems with backtesting investment performance

How’s this for a new year’s resolution: stop backtesting investment scenarios. The results rarely have any useful meaning.

Many investment columns, websites and services centre on the idea that testing past specific investment criteria performance should lead to positive future results. If nothing ever changed, this approach would actually work, but that’s not reality.

With more investors turning to automated investing alternatives, the risk of following a flawed strategy is increasing. With nascent services, there is often no performance track record, so the fallback position is to provide backtested results.

Yet the more you examine the process of backtesting, the more flaws are exposed. Here are five of those flaws.

1. Measuring stick changes

Over time, accounting rules change, impacting the measuring stick for results and ultimately the results themselves. Under such conditions, backtesting is akin to measuring in miles looking backward, while future results are denominated in kilometres.

We’ve looked at this issue before when critically assessing backward-looking quantitative screening (see “Strengths and weaknesses of quant investing”).

One might argue that backtested results are better than nothing. But we would counter that if those results give investors a false sense of security or optimism, it’s better not to share them.

Many say that when examining a real investment track record, the longer the timeframe, the better. Yet given that the measuring stick changes continuously, the longer the backtest, the more errors that creep into the results.

2. Cherry-picked criteria

The use of backtesting has exploded with a surge in smart beta ETFs. Since many of these products are new, they are formulated using backtested results, instead of real-world, real-time testing.

In April 2017, Andrew Lo, professor at MIT and author of Adaptive Markets, told Bloomberg Businessweek, “The more you search over the past, the more likely it is you are going to find exotic patterns that you happen to like or focus on. Those patterns are least likely to repeat.”

Research Affiliates, an investment firm with considerable knowledge of smart beta strategies, noted in August 2017 that “the outperformance observed before a typical smart beta index is launched virtually disappears once it’s live, yet most investors are making decisions on backtest results.”

Cherry-picking is not limited to smart beta ETFs. Some investment articles or services will run portfolio results based on criteria that seem less than intuitive and more a function of “backtest overfitting,” which is the recognized industry term for selecting investment criteria that look great in testing but fall short of expectations when actually implemented.

3. Convenient reference dates

“The best way to win any argument about the markets is to change your start and end dates,” noted Ben Carlson of Ritholtz Wealth Management in an October 2017 blog post. The same can be said of backtesting.

Just as investment criteria and benchmarks can be cherry-picked, so can reference dates. If you’re looking at the backtested performance of a certain strategy over the past 10 years, it’s important to compare the results over the past five years, 15 years, the 10 years that ended five years ago, and so on.

4. Lack of follow-up

Many investment columns and services fail to follow up on their previously recommended criteria. Quite often there’s a new idea to replace the old one, or the benchmarks have changed, or the criteria themselves have changed. A likely reason is the growing amount of evidence that shows backtested results fail once real money is invested, as noted in the August findings of Research Affiliates.

5. Manipulated checks

Many backtested results are never subjected to academic or proper statistical scrutiny. Criteria should be checked using other time periods, or so-called “out-of-sample” data, but most investment services or articles rarely disclose whether this step was taken. Outside academic journals, you can assume it hasn’t been done. But even out-of-sample checks can come back clean if enough different criteria sets are run.

Nothing replaces real results

There is simply no replacement for an actual performance track record. Researchers have shown that the more complicated a criteria set, the worse a backtest will perform under real conditions.

In our opinion, the more intuitive the chosen criteria, the more likely the strategy is to succeed. Why? It will survive in a market where it is already in use by human investors. Further, it has survived the test of time.

What is backtesting?

Backtesting uses historical market data to see how a portfolio of stocks performed in the past. It examines a universe of stocks chosen for certain investment criteria, such as having cash flow growth greater than 2% or a return on capital greater than 10%. Quite often, five or six criteria might be used, as well as conditions for rebalancing, replacements and stop losses. In short, a simple idea can get complicated.

Al and Mark Rosen

Investments

Al and Mark Rosen run Accountability Research Corp., providing independent equity research to investment advisors across Canada. Dr. Al Rosen is FCA, FCMA, FCPA, CFE, CIP, and Mark Rosen is MBA, CFA, CFE.