Posted 08/12/2020

Interpretable Machine Learning for Long Term Investing



ML AI Interpretability Investing Long-Term Shapley

Opening the ML Blackbox

Machine learning (ML) algorithms are often dismissed as inexplicable blackboxes, but the idea of interpretable ML has been prying open that blackbox for some time now. Interpretable ML increases the degree to which a human can understand the root of an algorithm’s decision. With increased interpretability, a human can alter, intervene, and/or build upon an algorithm’s decision making process. After all, human plus machine is often far more powerful than either on its own.

In our case–with the application of predicting performance of investments and investment ideas–we gain several advantages after learning how our models come to a final decision. For one, if we know how a model makes a decision, we significantly reduce the risk of overfitting our algorithms, which would lead to more reliable performance with out-of-sample, future data. Moreover, we can use explanations from interpretable ML as a research tool. With our models and their interpretations, we can develop new hypotheses and theories about the relationships of economic, financial, and behavioral factors (which we will call features).

Example ML Problem

For this post, we’ll build a model to forecast one-year stock returns relative to an index (S&P 500) after that stock is recommended in online investment communities. In order to eventually interpret our model, we first train it with data consisting of various features (explanatory variables or factors) to predict a target (the outcome we’d like to maximize). Here is an example snapshot of how our data is structured for this example problem before any pre-processing, natural language processing, or normalization:

Sample Raw Data Structure

Sample Raw Data Structure
Fabricated sample data, showing fake numbers.
“…” indicates more data and features included in actual dataset

Data, Features, and Pre-processing

This data consists of several thousand global equities that were pitched to online communities with associated natural language text write-ups. Text write-ups are descriptions of why someone recommends buying (or selling) the pitched stock at that particular time. We also collect financial data and calculate 1-year trading returns for all of those write-ups. The dataset spans from 2006 through today and is carefully corrected for things like survivorship bias and potentially missing or incorrect data. Other pre-processing like normalization, market neutralization, and feature engineering that we perform all have a large impact on model performance.

For market data, we look for traditional fundamental and valuation metrics along with their respective growth rates over different periods of time–normalized by market movements. We collect metrics like FCF Yield, Net Debt to EBITDA, ROE, ROIC, along with their close cousins, and associated growth rates, derivatives, historical premiums/discounts, and more. We clean and normalize many other types of data to include in our live models because a lot of signal lies in creating unique features derived from some sort of domain expertise and intuition.

For this example, we’ll only include a subset of features. And although price movement itself is normally not a large determinant of when to make a purchasing decision, we’ll also look at data of 3-month, 6-month, and 1-year relative returns of a stock prior to its write-up submission date. We think these momentum features can act as a measure of how sentiment about a stock has changed recently.

With text data from the write-ups and pitches, there are several ways that we can extract information to include natural language components in our feature set. First, we can take all of the words from a write-up and count how many times they occur, normalizing. This gives us a sparse TF-IDF matrix of normalized term frequencies across all documents. We can build a model directly on these term frequencies or try to condense the information a bit more.

To condense the information, we can perform dimensionality reduction or unsupervised learning with fancy acronyms (like SVD/LSA or LDA). These acronyms are methods of taking the term frequencies in write-ups and representing them in a smaller dimensional space so our algorithms can better understand them and learn different concepts. The overall goal is to learn different mathematical representations of an investment write-up. We will call these different representations components. Components can represent abstractions like the industry of a write-up, whether the author mentions insider buying or stock buybacks, and any other number of details that might be contained in investment write-ups. These components contain important signal that can indicate the quality of an idea, especially when paired with financial data.

Neural models are built with neural networks and allow for the use of embeddings, which can act as multidimensional representations of words, sentences, or documents. More recent neural language models like Transformers are being trained on enormous datasets to allow better understanding of documents than ever before. Although we are using these improved neural language models in production, this example problem will stick only to using dimensionality reduction on term frequencies. See our posts on Language Models to see what new neural language models with self-attention are capable of.


Remember, with the data we’ve gathered for this example problem our target is to predict whether or not a stock outperformed the S&P 500 over a 1-year time frame. First, we find a stock’s 1-year return vs. the S&P 500 (stock’s 1Y return – S&P500 1Y return) starting from the date that a stock was pitched to the algorithm. If the relative return is positive, the target is labeled “1” for outperformance and if the number is negative, the target is labeled “0” for underperformance.

Selecting the target and transforming it correctly is important, similar to the pre-processing needed for the features. In production, we are flexible to time horizons, benchmarks, and more. By removing the effect of the market on our targets, we’re taking a step closer to answering the question of how we can find features (and patterns of features) that generalize to out-of-sample outperformance, hopefully in a wide range of market conditions.

Model Selection, Training, and Evaluation

Finally, before getting to model interpretations, we need to pick a model to train. In this instance, because our target is binary (1-year outperformance or underperformance relative to an idex), we’ll train a classifier to predict the outperformance of certain stocks vs. the S&P 500 over a 1-year period. In particular, Gradient Boosting Classifiers (e.g. xgboost, lightgbm) perform well and can be interpreted with the tools we are going to use (shap/TreeSHAP).

What we’re really doing during the training process is learning about how different features (aka factors) over the training period were predictive of outperformance a year ahead. However, the more important question we need to answer is: to what extent can the model’s inference generalize to out-of-sample, future returns? There are plenty of techniques we can employ to ensure, or to at least encourage, generalization (e.g. purged cross fold validation) during the model training and backtesting process.

Then while training the model, we need to choose one of many evaluation metrics to score how well the model is doing. We can select from metrics like balanced accuracy score, precision, recall, F1-score, ROC-AUC, etc. These metrics and which one we should choose to optimize ultimately depends on whether we are looking to increase the chance of correctly identifying a successful investment idea or whether we are looking to decrease the chance of incorrectly picking an unsuccessful investment idea. Long term investing is a probability weighting game.

Interpreting the Model

After training our model with satisfactory test results, we can now feed it validation data, which are new investment ideas that the model has never seen before. From the classification of that new data, we will be able to see how the machine tries to generalize its decision making function. In other words, we can see roughly why a model made a specific decision about any proposed investment idea in terms of the features that we provided.

From a technical standpoint, we’ll be using Shapley/SHAP values, which can be thought of as our feature importance scores. SHAP values allow us to also compute explanations for each prediction a model makes. For our purposes, think of SHAP values as how much the model “likes” or “dislikes” a given feature. If we combine all of the SHAP values for each investment submission, we can see a global interpretation of the model’s decision making process:

SHAP Value Summary Plot

SHAP Value Summary Plot
The summary plot above is computed by looking at every single SHAP value for all investment write-ups over each feature. Each dot is a data point and each feature’s explanation is another row on the summary plot. The different features are ordered on the y-axis (vertical) according to their total importance. The x-axis (horizontal) is the SHAP value; to the right of zero means the feature value for that datapoint had a positive impact on the model’s output (more likely to outperform the market) and to the left of zero means the feature value for that datapoint had a negative impact on the model’s final probability output (less likely to outperform the market). The color from blue to red represents the normalized value of the feature (factor) from low to high. A very bright red value would represent the highest value of a certain feature and a very bright blue value would represent the lowest value of a certain feature.

Let’s try to decipher it a bit more by looking at the second row of the summary plot, FCF Yield LTM. We can see low (blue) values towards the left end of the chart and we can see high (red) values towards the right end of the chart. This means that the model likes high FCF yield companies, but dislikes low FCF yield companies. That interpretation of the model’s decision making seems to make a lot of sense from a theoretical, value-minded perspective too, where we would prefer to invest in companies with high free cash flow relative to their market values.

Next, if we look at net debt to EBITDA, closer to the middle of the chart, the model has the expected, opposite relationship. The model likes when a company has a low net debt to EBITDA ratio and dislikes when companies have high net debt to EBITDA ratios. This also makes sense from a long term, value investing standpoint.

So with the above summary plot, we know that our model is suggesting that, in order to have the best opportunity to outperform the market over a 1-year time frame, we should buy high FCF yield, low net debt to EBITDA businesses with healthy balance sheets and recent relative return momentum. The businesses should also be trading at discounts to their historical 1-year and 3-year P/E and EV/Sales valuations.

But, if you can learn these heuristics in any introductory investment course, why are these model decisions so novel? For one, the model figured out all of these investing heuristics on its own solely from viewing past examples. We did not teach the model these rules, it learned these lessons and many more from the data itself. These rules of thumb aren’t clouded by the emotional biases that humans carry. And the model didn’t just figure out the general rules like “buy high FCF yield,” it has also learned multidimensional heuristics based on the interplay of various factors–more complex than many can fathom and difficult to visualize on a 2D plane. The model’s further advantage over humans is that it can calibrate probabilities and, therefore, bet sizes (we’ll get into the important details of bet sizing in another post) more accurately.

Yet another advantage is that we can dig a bit deeper into the decision making patterns the model has created by looking at a single feature’s SHAP values. We know that this model likes stocks with high FCF yields and dislikes low FCF yields, so we can zoom into the decision plot of the model for the FCF yield feature. Each dot is a single data point (a stock) from the validation dataset: ideas that the model has never seen before. The x-axis is the value of the post-processed FCF yield and the y-axis is the SHAP value for that feature, which represents how much knowing that feature’s value changes the output of the model for that stock’s prediction. In other words, SHAP values above 0 increasingly indicate the model believes a stock is more likely to outperform the S&P 500 over a 1-year time frame. The inverse holds true for SHAP values below 0. We can interpret this for the FCF yield feature specifically:

FCF Yield Decision Plot

FCF Yield Decision Plot
x-axis is normalized, not actual FCF Yield numbers, but a normalized version of them
It looks like the model has created a non-linear decision function for the FCF yield feature. The values of FCF yield seem to be split into thirds: 1) the first third is all contained below the 0 SHAP value, so the model dislikes low FCF yield numbers, 2) the second third is tightly packed between 0 and 0.1, meaning the model only slightly likes middling FCF yield numbers, and 3) the last third looks like a positive linear function, so beginning at higher FCF yield numbers, the model likes large FCF yields with increasing “enthusiasm.”

For another example, let’s see what the model thinks of some of our momentum features, “Price Change 1M Bench Adj” and “Price Change 6M Bench Adj,” which measure a stock’s relative return to the S&P 500 return 1-month and 6-months, respectively, before a stock was pitched to the model. They’re essentially two momentum factors over different durations. These decision plots below show a more granular picture of how the model reacts to different ideas of momentum:

Momentum Decision Plot Over Different Durations

Momentum Decision Plot Over Different Durations
In the 1-month momentum plot on the left, we can see a parabola. The model thinks that stocks with very recent, negative relative market returns and positive relative market returns are more likely to outperform the market over the next year. This effect flips with more average 1-month momentum numbers, such that the model thinks stocks with middling performance will have less of a likelihood of outperforming the market over the next year.

In the 6-month momentum plot on the right, we see a very different effect. There is a positive linear relationship between 6-month momentum and 1-year forward outperformance up until a certain point. To start, negative 6-month momentum contributes to a decrease in the likelihood of outperforming the market over the next 1-year. As 6-month momentum increases, so too does the stock’s likelihood of outperforming the market (according to our model). Eventually, if a stock has had very strong recent 6-month momentum, the model’s probability of outperformance for that stock begins to decrease.

We can also interpret multiple different models that were trained on the same data but came to slightly different conclusions. There can be advantages to training multiple models and combining their predictions, often called ensemble modeling. Here is the 6-month momentum effect from model #1 above, compared to a different model, model #2:

Momentum Decision Plot Between Different Models

Momentum Decision Plot Between Different Models
Although these decision plots look similar, we can tell that model #2 doesn’t have a decrease once momentum has reached a certain point, whereas model #1 clearly slopes downwards after a certain amount of 6-month momentum is reached. Different market regimes may exhibit different momentum correlations, so it’s best to include a wide variety of possible models. In general, ensemble modeling reduces the risk of overfitting and captures more variance. If we were to allow both of these models to vote on a final prediction, the combination is likely to perform better, over time, than if we were to just use one of the models on its own.

Next, we can see how different features interact with one another. An interaction effect is “the additional combined feature effect after accounting for the individual feature effects” (Molnar). It’s best explained through another example. Colors represent the value of another feature that has an interaction effect with the main plotted feature. If an interaction effect occurs between features, plotting it will show patterns of color:

Premium to Analyst Price Target Decision Plot with Interactions

Premium to Analyst Price Target Decision Plot with Interactions
In this figure, we’re looking at how a stock’s price discount or premium to it’s sell-side target price influences our model’s predicted probability of outperformance 1-year ahead. We see that stocks trading at discounts to their sell-side targets (to the left of 0 on the x-axis in the chart above) with recent six month momentum (red dots) are more likely to outperform stocks trading at discounts to their sell-side targets with a lesser amount of 6-month momentum (blue dots). However, once stocks start trading at a premium to the sell-side price targets (to the right of 0 on the x-axis in the chart above), it seems that a higher relative recent price momentum reduces the likelihood of outperformance over the next 1-year.

Lastly, we’ll touch on the NLP components from the investment write-ups. As discussed, these components represent different abstract topics about write-ups. These abstract topics might be something like: “secular industry growth” or “management and culture.” Each write-up is then scored based on how much of that component is expressed in the write-up. . :

Decision Plot for NLP Component 3

Decision Plot for NLP Component 3
We won’t get into what this NLP Component 3 specifically represents, but a clear relationship is established between the component and probability of outperformance 1-year ahead. This means that as a write-up contains “more” of NLP Component 3 in it, the model views the recommendation to be less likely to outperform the market. Of course, this is not causal, but it gives us insight into what categories or topics in investment write-ups are important to note.

Building Blocks

We view these interpretations as only a small part of the building blocks for investment decision making pipelines. On top of this example problem, there are countless opportunities for modeling and interpretation improvements. For our NLP components, we can (and do!) use neural language models to encode write-ups. For our targets, we use triple barrier labeling, as well as multi-class and multi-label classification. For interpretations, we can employ confidence intervals. We are testing macro data to better understand macro regimes and improve the ensemble model and idea selection. Also, feature weights and interpretations can be constantly updated and re-trained as new data comes in. The process is constantly improving upon itself.

As we know, the risk of having overfit an algorithm is significantly reduced if an interpretation of the cause of a decision made by a model makes intuitive sense to humans. With these tools we use and build upon, not only do we have a way to identify probabilities and predict outperformance, we’ve created infrastructure to educate and assist critical decision makers. Important decisions can be informed by probabilistic rules rather than by emotion and implicit bias.

Further Reading and Sources:
* (Molnar)
* (Lopez De Prado)