Naive Bayes Market Timing | Theory and Speculation

Shown above are the total returns for a market timing model, the S&P 500, and 10-year Treasury bonds from 2005 to mid-2020. The model decides on the first of each month whether to invest 100% in the S&P 500 or 100% in 10-year Treasury bonds, depending on a few simple factors and historical relationships of those factors to S&P 500 and Treasury bond returns from 1950 - 2005. Looks pretty good! But should you believe this outperformance is real or just another exercise in data mining? Do I even believe this outperformance is real?

Building a market timing model that has real predictive power is notoriously difficult. It’s such a sticky problem because it’s too easy - that is to say, easy to build a model which fits the historical data extremely well. The problem is that such models rarely hold up when applied to new data, aka “the future”. We have a nearly infinite set of predictors to choose from (every economic indicator imaginable, dozens of stock valuation and historical price-related metrics, who won the Superbowl this year, etc...), but if our target variable is “will the stock market as a whole go up or down over the next month”, we’re left with a pretty limited set of historical data for the target variable (the thing we're trying to predict). These two ingredients, abundant predictors and limited training data for the target, typically combine to produce overfitting - that is, a model which fits the historical data extremely well, but predicts new data poorly or not at all.

For many difficult problems in predictive modeling, the popular approach recently has been to employ more and more complex models which can take advantage of more and more computing power. The canonical example of this type of high-powered, highly complex model is the neural network. Neural networks are extremely powerful, capable of correctly identifying really complex relationships in data (such as recognizing digits in handwriting), but in addition to requiring a lot of processing power, they also typically need to be fed a ton of training data in order to be successful. So while my laptop may be 100x as powerful as those that Wall Street pros were using in the 1980s to build early quantitative investing models, the historical data for the target variable I’ve got to work with hasn’t grown all that much. For example: if I want to predict whether the S&P 500 will go up or down in value over a one-month time period, and have predictor data going back to 1950, I've got ~850 training data points to work with (71 years * 12 months/year). This may sound like a lot, but for neural networks, it's piddly.

Even worse, These 850 data points aren't really as good as 850 data points ought to be. That's because most everything in statistics and predictive model building assumes that a sample is full of independent data points, and stock market performance over consecutive chunks of time are unfortunately not independent. Rather, they are autocorrelated: if the market is up in January, it's a bit more likely than usual to also be up in February. If it's down in March, it's a bit more likely than usual to also be down in April. Similar issues appear with many of the potential predictors: GDP growth for example tends to form cycles, remaining high for a while, then dropping, then back to high for a while, then sharply negative, etc...

On the far opposite end of the "accuracy vs complexity" spectrum is an unexciting, stodgy, nearly forgotten model which goes by the name of “Naive Bayes”. Naive Bayes models are about as simple as it gets for a classification problem; they take in a bunch of categorical predictors, like "was GDP growth positive last quarter?" or "are treasury spreads inverted", to generate a single binary outcome variable, in our case "will the S&P 500 outperform the 10-year treasury over the next month?". The basic version of a Naive Bayes model just looks at the training data, considers each predictor variable by itself, and asks "How often is the target 'yes' when predictor x is also 'yes' - more or less often than the overall average?" More often: x gets a high score; less often, a low score. Then it just adds up the scores (actually, multiplies, but you get the idea) for all the predictors to get a final prediction. Lots of predictors associated with a "yes" for the target variable -> higher probability of a "yes". Lots of predictors associated with a "no" -> higher probability of a "no". And all starting from "how often do we get a 'yes' for the target on average over all the data?". For more detail on the underlying math, Wikipedia or analyticsvidhya.com have better explanations.

So why would anybody use this simple, limited model? Since it only looks at each predictor independently, it can't figure out complex relationships in the data (a Naive Bayes model doesn't have a chance at recognizing digits in handwriting). Naive Bayes models basically have two selling points: they don't require very much computation, so if you need to train a model or produce predictions really quickly without much computing power, that might be nice. The other is that they are resistant to overfitting. By just completely ignoring any complex interactions between predictor variables, the a Naive Bayes model is naturally inoculated against overfitting. It is bad at finding complex relationships in data that turn out to be real, but it's good at NOT finding complex relationships in data that turn out to be not real. Since our primary concern here is to avoid overfitting, this simple, dumb model might actually be good.

Ok, so I've specifically chosen a model type which is pretty good at avoiding overfitting, but this is still a danger. If you throw enough predictors at pretty much any type of model, it will be in danger of overfitting. So we still need to be careful about which predictors we're considering. To that end, I only used a few that have been well documented to have some predictive quality for stock returns: momentum, valuation, the yield curve for US treasury bonds, and individual investor sentiment.

More specifically, for investor sentiment, I used data from AAII Sentiment Survey to construct a binary variable indicating whether investor "bullish" sentiment is greater than 50%. For valuation, I used a risk premium metric derived from Robert Schiller's data, indicating whether the earnings yield based on his CAPE metric minus the T-bill rate (from FRED) exceeds 0.01. For a yield curve indicator, I used the 10-year 3-month Treasury spread, in the form of a binary variable indicating whether that spread is positive or negative.

I constructed 3 different versions of momentum: binary variables indicating whether the S&P 500 rose or fell in price over the previous 12-month, 6-month, and 1-month periods. I knew momentum in general has been a particularly strong short-term market timing indicator, so I wanted to give the model multiple options to use on that front. The final model wouldn't necessarily use all versions of this predictor; I would let the performance against training and validation data decide which versions if any to employ.

OK, enough setup; how do we figure out if the model actually performs well? To find out while trying to avoid biasing my results, I split my data into three sets: 1950 - 1980 for training, 1980 - 2005 for validation and tuning, and 2005 - 2020 for a final test. "Tuning" in this case really just means selecting the best combination of predictors, as the type of Naive Bayes model I used (the R package e1071 implementation) is otherwise completely rules-based. I trained a few different versions of the model with various combinations of predictors on the 1950 - 1980 and measured the performance of each against the training and validation set. Unfortunately this choice of time range meant I couldn't effectively test the presence or absence of the "sentiment" predictor; it is only available from 1987 on, so a model trained on 1950 - 1980 data doesn't know anything about it. So while ideally I would let the validation data determine whether to include this variable, I simply chose to include this predictor in the final model.

Apart from that, I tested several different versions of the model, each using a different set of predictors chosen from the potential set. Whichever version produced the best results against the validation data, I would choose as my "final" model to evaluate against the test data (2005 - 2020). To my surprise and delight, the best-performing model against validation was...the full model using all the predictors (well, not using sentiment yet). So that's the model I would elect to use if I had to employ one of these with real money at stake. Just for completeness, though, I measured the performance of each model against the test period as well. It turns out that my blind bet to include sentiment as a predictor wasn't the right call, as the model excluding sentiment slightly outperforms the full model.

Model

Performance vs S&P 500 in Training Period (19

Performance vs S&P 500 in Validation Period (1980-2005)

Performance vs S&P 500 in Test Period (2005-2020)

Full Model

2.25%

5.01%

6.32%

Exclude Sentiment

2.25%

5.01%

6.48%

Exclude Risk Premium

2.26%

4.18%

5.31%

Exclude Yield Curve

2.06%

4.73%

4.56%

Exclude Momentum

0.63%

1.56%

1.25%

Momentum Only

1.66%

4.18%

3.76%

How close is this to being a closet momentum model? Kinda close, but not entirely. Performance against both the validation and test data indicate that while momentum accounts for the majority of outperformance, there is positive value added by including the other predictors (...except sentiment but at least it's not too much of a drag).

Below is shown the performance of the full Naive Bayes model, the S&P 500, and 10-year treasury bonds for the test period (log-scaled total return).

And now the cumulative performance difference between the full Naive Bayes model and the S&P 500 over the test period. While the bulk of the outperformance comes during the 2008 financial crisis, in which pretty much any momentum-based model outperformed the market significantly, that's not the entire story. There's also a step up in outperformance during the 2016 jitters, and a big jump when the model very fortuitously gets out of the market in Feb and March of 2020. I'm not sure how much credit to give it for this move, as I certainly don't think my simple factor model based on momentum, risk premium, sentiment and treasury spreads predicted a global pandemic. Of course, in February 2020, you didn't actually have to predict a global pandemic, it was already out there, and perhaps reflected in the treasury spreads which (at least the version I used) briefly turned negative during that time.

So in the end, should you believe these results? Do I believe these results? Honestly, I'm still not sure. Even with everything I've tried to do to avoid overestimating the model's performance, the base case when building any market timing model is overestimating the model's performance. I chose predictors that have a long history of some predictive value for market timing, but the most important one is momentum, and I already knew that was going to work well in 2008-2009 because every momentum model did well then, which means that even the out-of-sample performance isn't entirely out-of-sample. Still, the model performs consistently well over many different market environments, and like any good market timing model, is usually just 100% invested in stocks, so I do think there may be something here.