Model Drift, Automatic Retraining and How Not to Ruin your Models
Predictive models used in business processes are prone to loose their business value over time. This can be due to model drift or a result of unforseen side-effects of automatically retraining the model. In this blog we explain what model drift is and why (automatic) retraining can be a good way to deal with it. And we discuss reasons why retraining should be done with care, since it can also have unanticipated effects. Most of all, we advocate to closely monitor and set up alerting to track performance of your models and we provide tips on how to do that.
Predictive models in production
In many companies the usage of predictive models has become common practice. Everywhere in the organization algorithms are used to make campaigns more effective and processes more efficient. In recent years it has become a lot easier to use these models to make predictions on new data frequently; scoring new cases on a daily basis or even continuously. Therefore, many organizations have productionized frequent scoring of their models to maximize their efforts. In addition to frequent scoring, automatic retraining of models has become much more feasible as well. Every night a new (better) version of model is created. Automatic retraining becomes the new standard, as it seems a great way to keep models performing as good as they can.
Or do they? Does this always keep these models performing the best they can? Often the answer is yes, but there are some things to keep in mind. And to monitor closely, to be sure your model still ‘does best what it is supposed to do’. In this article we go into more detail what the caveats of ‘auto-retrain-by-default’ are and how to make sure you’re in control and that you maximize your efforts. In this blog, we help you to set up the checks and balances to stay in full control.
Model drift, concept drift, data drift.
Let’s start with a simple question: Some time ago your rock star data scientist trained a killer predictive model, generating great business results back then. That model was implemented and is used to score new data and use those scores in business processes ever since. Why would you consider changing that piece of gold anyway? Well, there are some good reasons that might impact the performance of your model. Most importantly: concept drift and data drift.
Concept drift refers to that fact that we live in an everchanging world. What made us tick before might not be what makes us tick today anymore. Preferences, markets, circumstances, regulations, they all change. What we called trendy not that long ago might be considered old fashioned today. And as easy as it was some time ago to point out the spam messages in your mailbox, so difficult can it be nowadays. Therefore: the rules that our predictive models have learned during training can wear out. This is known as concept drift and it’s a good reason to consider updating our models, trained on ‘old world data’.
Data drift is another reason your models might stop predicting accurately. Even when the concepts we’ve modeled are stable over a longer period of time, the data we use – to score our models for one – does change constantly. Data storage techniques evolve, preprocessing pipelines migrate, software is updated, and sensors are replaced. And there’s the human factor impacting data as well: workflows and processes change, human errors occur. For these and other reasons, data changes and suddenly the new data used for scoring can look quite different from the data we used to train our model. This is what we call data drift.
Together, Concept Drift and Data Drift is what we could summarize as Model Drift: As a result of these types of drift, the model is not doing (as good as before) what it is intended to do any more. Retraining might solve this. And it might not…
Why not just retrain your models (continuously)?
We see that the world changes constantly, resulting in concept drift, and that data changes continuously as well, resulting in data drift. Then ‘Why not retrain every day?’ you might question. We see a trend that companies adopt this ‘retrain-by-default’ strategy to prevent using ‘outdated’ models. Often as the (only) solution to address these problems. We’re not claiming that it can’t be a good solution – we believe that continuous retraining can be a very good idea – but not always. We don’t see continuous retraining as the one and only solution. Let’s discuss some reasons why not, before looking at what else.
Why not #1: throwing the baby out with the bathwater
Often, predictive models are initially trained on a rich dataset, collected and prepared to get most predictive value from the available data. The observations are gathered with care, sometimes using test campaigns, A/B tests or careful selection of historic data to represent the population and the actual target (behavior) we want to predict as good as possible. Also, many features are crafted as potential top predictors by the data scientist when building the initial model, to get most predictive value out of the data available. The model that is trained on that data learns a lot based on this rich data and summarizes that knowledge as efficiently as possible (in a well-trained model).
When the model is constantly retrained, the quality of the newly gathered data used for retraining is key to the quality of the new predictions. No fuel, no gain! But is that newly acquired data as rich as the initial training data? This impact is often overseen. And another often neglected effect: The implemented initial model might impact the richness of the new data we acquire! Because we started to use that model, we might never see the possible target behavior of specific types of observations again, since our model told us not to.
For instance: A fraud prediction model might have learned that it is best to exclude specific visitors of your website from buying your products or services on credit. Since this model is implemented, you might have little or no new observations of those fraudulent visitors in the newly acquired data. However, this means that if certain important predictive behaviour is not seen any more, the model is not able to ‘relearn’ whether that truth still holds in the ‘new world’. Is there a risk your newly retrained model reopens doors you so carefully closed with your prior model? Retraining only on recently gathered data, after the initial model is implemented, might therefore destroy the value of your model, if no action is taken to gather new data that is rich enough.
Why not #2: putting band-aids on actual data problems (data drift)
All of you that have data for breakfast, know this for a fact: data changes over time. Not only due to actual changes – like the increasing world temperature and seawater levels – but also due to changes in data collection, storage and processing. As stated before: storage techniques, pipelines, sensors, workflows, they all change in time. And things can break occasionally. That means that the features we used in training might also be different at the time we use them for scoring.
Retraining could fix that, if we would retrain on a set where the feature is similar in the complete retraining set. Often, this is not the case, though! A part of the data is on the ‘old version’ of the feature and a part is on the ‘new version’. Sometimes this can be restored, but that is not always the case. For instance: replaced or malfunctioning sensors generating the data. Retraining the model on a period that has both versions in there, will probably dismiss the feature due to the different content in the feature over the whoke retraining period. Whereas it might have been a very important feature before… Checking if the feature data is restorable instead of just retraining might be the better option here. Just keeping on retraining automatically until the ‘old’ or ‘bad’ data is out of scope is often not the best option…
Why not #3: The world is reshaped every day! Isn’t it? (concept drift)
A third reason to be somewhat conservative in retraining models every month / week / day / hour is the honest answer to the question: Is the concept we try to predict really that ever-changing? Do the customers that respond to our campaigns do that for reasons that differ from the customers that responded to the same offer yesterday? Or the day before? Are we retraining all our models that frequently because we think we need to or because we know we can? Are we actually seeing that the model stopped performing?…
Don’t get us wrong, automatically retraining models is in many contexts a very valuable tool to achieve our goals, but what holds for models in general also holds for automatically retrained models: it should not be a goal in it self to retrain models as frequently as we can. We should look for the optimal frequency of retraining, and for that we need to know how good (or bad) our model is performing. And if the data we use for scoring is changing. And that’s what we’ll discuss next: how can we keep track of that?
To stay in control of our models: monitor outputs and inputs!
If we shouldn’t retrain models automatically all the time, what should we do then? Our advice: before automating model retraining, set up essential checks and balances, both on the outputs as well as the inputs of your model scoring process! This ensures you’re in full control that the predictive model you’ve implemented keeps doing what it’s supposed to do.
We advise to monitor three things to confidently keep using a model. This can be a static model that’s only trained once or retrained ad hoc, or this can be an automatically retrained model. Further on, we provide you with tips when it is advised to consider automatically retraining a model. And we will also go into detail how to set up really helpful and valuable monitoring – alerting only when relevant developments in relevant metrics arise – so you won’t be overwhelmed with yet another stream of metrics/alerts.
#1. Monitor the business outcome
A model is a means to accomplish a goal. This goal is a business outcome: response to an offer, retention of a customer, failure of a process, … The model is not the only factor responsible for success, but it has an important role in accomplishing this goal. Monitoring the business effect with regard to the model outcome – such as the business result per model score decile – is closest to what the model is supposed to do in business terms. Sometimes – not always – it is possible to monitor the business outcome in terms of model accuracy, precision or recall over time. If so, this is an important way to monitor how well the model keeps on performing. When these metrics change, it might be that the model is wearing out or something goes wrong in the scoring process.
#2. Monitor model scoring output
Frequently – once per month, week, day, minute, second – your model is applied to new data to score those new records. This periodic scoring process results in a prediction score: the probability that a customer will respond to an offer, the probability that a process will break or the predicted number of inbound calls for tomorrow. An important indicator if something is changing – due to concept drift or data drift – is keeping a close eye on these prediction outcomes. Does the minimum, maximum, mean or median value of these scores change or are more cases suddenly without a score? Big changes and long-term trends should result in alerts, to signal change that might impact business results. Since changes in model scores can have a delayed effect on business results, it’s very important to monitor model scoring output.
#3. Monitor model scoring input
When data drift occurs, this is best identified by keeping a close eye on the changes over time in the features used in the model. These changes have a direct impact on the model scoring results. When automatic retraining is in place, you might not directly see a big change in the model scoring results as parameters are adjusted and changing features might be dropped. It might decrease the predictive power of you model though, if it is an important predictor that changed – we discussed this earlier. Therefore, it is crucial that you know whether a feature changed so you can check whether it is a repairable data issue that caused the feature to change.
Monitoring in practice: popmon
So, we know that most companies have many different predictive models in production. And we just indicated that all these models – their outputs and inputs – need monitoring. Most of these models have tens if not hundreds of features. Keeping a close eye on possibly important changes in all these outputs and inputs seems unfeasible! Luckily, there are some good tools to help you monitor only relevant changes.
Popmon is one of those tools. It helps you set up monitoring, that only results in an alert when something significant happens to an outcome or income. It does so by summarizing the distribution of variables over time and compares new periods to these prior periods. Multiple metrics are calculated based on these comparisons and when a significant change occurs, an alert is triggered. This prevents you having to keep a close eye on all those hundreds of outputs and inputs of scoring. When an alert is triggered, you can (and should) investigate it to see what is off and what you need to do to make sure the model keeps on performing at its best.
To help you getting started with popmon, we’ve written a notebook that shows you how you can apply popmon to your data and get the relevant metrics and alerts to integrate with your own reporting and dashboarding tools. Popmon also provides out of the box html reports, in the popmon documentation you can read more on those.
Other things to consider before you start (automatic) retraining
When it is apparent that your model starts to wear out, because of concept drift or data drift that is not the result of repairable issues, you should retrain your model. Model retraining can be done ad hoc until a new signal of decay comes up, or you can set up an architecture to automate retraining, periodically or when decay is significant. Before retraining, there are some things to keep in mind to prevent that you destroy the value of your model.
#1 seasonality
Is it apparent that there is seasonality in the target you are trying to predict? Sales or processes and their relationship with features used for prediction might be different during the year. If this is the case, make sure retraining is done on data that enables you to address this seasonality!
Temporary changes in the market can be seen as special case of seasonality. Do you really want your machine learning model to adjust to that? Something like a short-term production issue, an event in society or bad publicity might impact short term behaviour. Do you really want your model to incorporate that in its predictions or is it best to keep its knowledge how things will be when things go back to normal?
#2 prior model impact
At the moment the model was first implemented, it might have started changing the new data you gather. Remember our example: A fraud model excludes specific profiles to be able to buy your products or services. Therefore, you might have little or no new observations of those profiles in the newly acquired data. Is there a risk your newly retrained model reopens doors you so carefully closed with your prior model?
On the other hand, retraining on the narrowed down population might actually further decrease impact of the model. Imagine a response model that predicts the most likely prospects for one of your products. The previous version might have already ruled out specific parts of the target audience – age classes, geographic regions – as not relevant. When these implemented rules are strict and no random prospects are selected to enable future learning, your retraining efforts can result in smaller and smaller target audiences within the previously narrowed boundaries.
#3 focus on your (business) goal
We already mentioned it earlier: A model is a means to accomplish a (business) goal. Not rarely, we encounter organisations in which implementing automatic retraining for alle their models seems to be the higher goal. In our view this is not how it should be. Whereas automatic retraining can be a most valuable tool it still is a means, not a goal.
Wrapping it up
In this article we introduced model drift and its variants data drift and concept drift. We show how they can impact your predictive models. Many organizations start automatically retraining models – anticipating possible wear out due to model drift. We show that this can be very beneficial, but should be done with care, because there is the risk of destroying the performance of your models as well. We advise to start with implementing the right monitoring and alerting – for instance using tools like popmon – and we end with some last tips to consider before you start ‘auto-retraining everything’.