a practical guide for using the popmon package

Introduction

Our clients often ask us what the expiration date of a predictive model is. Unfortunately, this is not something where you can apply a general rule of thumb. The expiration date of a model completely depends on the changes in the world around the model, a.k.a. the model drift, which can be split into concept drift and data drift as my colleague Jurriaan Nagelkerke explains into detail in this very interesting article. In this article you can not only learn about model drift, concept drift and data drift, but also why (automatic) retraining isn't always the best solution, and even better yet, what you should do to keep your models alive and kicking! I highly recommend you to read this article before reading this one. In this article I assume you already know about model drift and are looking for a practical example how to implement monitoring of models in production.

To be able to monitor the model drift, ING has created their own Python package: popmon. We love (their work on) this package, but it did take us a little while to comprehend how to use this package in a way that would help us get a sense of our model drift. That’s why we decided to help all of you newbies to popmon, who are also interested in monitoring model drift. The beauty of this package lies in the fact that ING definitely applied the KISS (keep it simple stupid) method when developing it. Meaning we only need a couple of smart functions to reach our goals. Unfortunately for us, Data Scientists, we always want to exactly grasp what we are doing, so this also meant a deep dive into this package and into what these functions were doing. So, let us take you on a (short and simple) ride through popmon.

Our use case

Since there are multiple use cases for popmon, we want to start with describing the use case we will be focusing on in this article: a Data Scientist/Analyst created a predictive model, which will be taken into production and therefore model drift monitoring needs to be setup for this model.

To be more precise: We want to keep track of the data that is used to score our model and constantly check whether this data is not too different from the data where the model was trained on. This is a very common use case, although often this monitoring is not set up.

Installation of the package

And then, before we can really start, we need to start by installing popmon, which is as simple as:

!pip install popmon


Initialize the notebook

In [1]:
# import packages
import pandas as pd
import pickle
import popmon
from popmon import resources
In [2]:
# make sure all output in each cell shows, without explicitly printing them
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
In [3]:
# adjust default display settings of pandas dataframes
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)


The data

This is a dataset containing multiple features, for the sake of simplicity, we only keep a couple of these features and pretend we used these features to predict a column with the name isActive. Since this is the dependent variable, which we can't monitor in production since it is unknown, we've removed this variable from our trainingset as well.

Import our example data
In [4]:
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df = df[["age", "balance", "eyeColor", "favoriteFruit", "gender", "latitude", "transaction", "currency", "date"]]
df["date"] = df["date"].dt.to_period("M")
In [5]:
df.head()
Out[5]:
age balance eyeColor favoriteFruit gender latitude transaction currency date
0 16 $1,333.24 grey banana female 18.821230 1131.03 DOL 2015-05
1 57 $3,777.44 brown banana female -82.134610 -896.55 DOL 2015-06
2 26 $1,248.97 blue strawberry female 62.383017 1224.04 DOL 2015-10
3 63 $1,049.77 green strawberry male -26.913326 541.08 DOL 2015-06
4 74 $2,942.64 red strawberry male -2.777228 -372.62 DOL 2015-03
In [6]:
_ = df['date'].value_counts().sort_index().plot(kind='bar')



Create the different batches we would have when this model would have been in production
Because we're using a single example dataset for now, we need to split this into a couple of different batches to recreate our example of taking a model into production and scoring new data regularly:

  • our traindata
  • the first batch of scoring data when the model is in production
  • the second batch of scoring data when the model is in production
  • the third batch of scoring data when the model is in production

Let's say we want to check our model scoring data for changes on a monthly basis.

In [7]:
# our traindata is created on data of the first 8 months of 2015
df_train = df[df.date < "2015-09"]
In [8]:
# batch_1 would be the first batch to be fed to the model after this has been put into production
batch_1 = df[df.date == "2015-09"]
In [9]:
# batch_2 would be the second batch to be fed to the model after this has been put into production
# we won't use this dataset in this example, but you can use it when playing around with this notebook yourself
batch_2 = df[df.date == "2015-10"]
In [10]:
# batch_3 would be the third batch to be fed to the model after this has been put into production
# we won't use this dataset in this example, but you can use it when playing around with this notebook yourself
batch_3 = df[df.date == "2015-11"]


Initalizing the monitoring

We will use the trainingdata to learn popmon what our data looks like, already dividing the data into the different months, so popmon knows what it should expect from a new month. This function uses another package built by ING: histogrammar. A key strength of popmon (using histogrammar) is to save all that is needed to monitor for each feature in a histogram object. Histograms are small in terms of storage and safe in terms of privacy risks as no specific values need to be stored. Also, histograms can easily be compared to other histogram objects, resulting in fast processing once the histograms are created. These histograms are used to calculate all the metrics used to create the alerts. Popmon generates a whole load of metrics, we've given a short overview and explanation of all of these metrics in this article.

In [11]:
hists = df_train.pm_make_histograms(time_axis="date")
hists
2022-04-15 08:12:51,536 INFO [histogram_filler_base]: Filling 8 specified histograms. auto-binning.
100%|█████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.60it/s]
Out[11]:
{'date:age': <Categorize values=SparselyBin size=8,
 'date:balance': <Categorize values=Categorize size=8,
 'date:currency': <Categorize values=Categorize size=8,
 'date:eyeColor': <Categorize values=Categorize size=8,
 'date:favoriteFruit': <Categorize values=Categorize size=8,
 'date:gender': <Categorize values=Categorize size=8,
 'date:latitude': <Categorize values=SparselyBin size=8,
 'date:transaction': <Categorize values=SparselyBin size=8}




We can get the binning specifications from the histogrammar objects, this way we can make sure that the histograms created on a new batch will be exactly the same.

In [12]:
bin_specs = popmon.get_bin_specs(hists)
bin_specs
Out[12]:
{'date:age': [{}, {'binWidth': 2.0, 'origin': 9.5}],
 'date:balance': [{}, {}],
 'date:currency': [{}, {}],
 'date:eyeColor': [{}, {}],
 'date:favoriteFruit': [{}, {}],
 'date:gender': [{}, {}],
 'date:latitude': [{},
  {'binWidth': 4.347556988888889, 'origin': -85.24212737777779}],
 'date:transaction': [{},
  {'binWidth': 62.90755555555556, 'origin': -1019.9291111111112}]}




Now we're ready to compare new periods to the reference (train) period. This comparison will be done using the metrics from the histograms we've created, the output created are other metrics calculated based on the histogram metrics, therefore you won't see any histograms in the next piece of code. Popmon provides us with a lot of useful output and will guide us in evaluating whether new data is similar (enough) to reference data. We will use that information without looking at the historgrams.

Before we look at more popmon output, it's good to briefly introduce some key concepts of popmon. For more detail, we refer to the package documentation, but these concepts are essential to understand:

  • profiles: per batch (time period), popmon calculates several profile statistics for each feature, like the mean, min, max, missings, distinct value count.
  • comparisons: each new batch is compared to the specified reference (see next concept) and several comparison statistics are calculated, mainly significance tests (chi square, kolmogorov-smirnov) to evaluate the difference in the histogram distribution of the new batch compared to the reference.
  • references: what to compare with? popmon offers four options: reference data - a specified external source - rolling window over preceding time periods, prev1 - the preceding time period and expanding hence all preceding time periods.
  • alerts: a traffic light style alerting based on the profiles and comparisons, to only warn when something important (enough) is different in the data, compared to the reference. With different, customizable thresholds green, yellow and red traffic lights are defined for each feature and each metric.

We won't go into detail for now, but it might be useful to know that it is possible to change the monitoring rules popmon uses to create the alerts. We can change the rules for a single metric (which is applied to all features) or for a (group of) feature(s). Below you can find a code example on how to change the monitoring rules, where we for instance set the boundaries for age to 18 and 100 for a "yellow" alert and to 0 and 120 for a "red" alert. Meaning that if a value in the feature age is outside of these boundaries an alert will be triggered. Popmon uses very intelligent and useful monitoring rules by default, so most of the times it probably won't be necessary to change this at all.

In [13]:
monitoring_rules = {"*_pull": [9, 3, -3, -9]
                    , "*_zscore": [7, 4, -4, -7]
                    , "[!p]*_unknown_labels": [0.5, 0.5, 0, 0]
                    , "age:min": [120, 100, 18, 0], "age:max": [120, 100, 18, 0]
                   }

Oh and one more thing, because there is one term you will often see in the output but might not immediately be clear to you: the pull. Pull refers to 'the normalized residual of a value of interest with respect to the selected reference'. Not sure if that made things clearer yet... In our own words, the pull is the standardized version of each statistic so that they are on the same scale and we can have the same threshold for different statistics. The pull indicates to what extent a value on a statistic should be interpreted as an actual difference in the new data as compared to the reference data. It's calculated for every profile of every feature like this:

$pull = (value - reference[mean]) / reference[standard deviation]$


Ready to calculate the monitoring metrics and alerts

To illustrate the type of objects popmon creates, we start with calculating all the metrics (and alerts) on the traindata. Do note that this is not necessary when we want to see if new scoring data is similar (enough) to training data to be used for scoring.

The train data contains multiple periods/batches of data (the time-axis parameter we used to make the histograms). The datastore object we're creating here is essentially a dictionary holding all calculated metrics and more importantly the alerts that were generated due to these metrics. It is recommended to explore this dictionary to see what you can expect and, even better, to decide how you want to use this information to create your own alerts. Where you can use these values to trigger the messages (for instance Slack, Email) you want to receive. Making this completely and easily adjustable to the preferred way of working in your company.

In [14]:
# calculate the metrics based on all histograms and the monitoring rules we adjusted (when no adjustments made you can ignore this parameter, popmon will use its own default)
datastore = popmon.stability_metrics(hists=hists, monitoring_rules=monitoring_rules)
2022-04-15 08:12:53,842 INFO [hist_splitter]: Splitting histograms "hists" as "split_hists"
2022-04-15 08:12:53,860 INFO [hist_comparer]: Comparing "split_hists" with rolling sum of 1 previous histogram(s).
2022-04-15 08:12:56,736 INFO [hist_comparer]: Comparing "split_hists" with reference "split_hists"
2022-04-15 08:12:57,252 INFO [pull_calculator]: Comparing "comparisons" with median/mad of reference "comparisons"
2022-04-15 08:12:57,332 INFO [hist_profiler]: Profiling histograms "split_hists" as "profiles"
2022-04-15 08:12:57,412 INFO [pull_calculator]: Comparing "profiles" with median/mad of reference "profiles"
2022-04-15 08:12:57,821 INFO [apply_func]: Computing significance of (rolling) trend in means of features
2022-04-15 08:12:57,880 INFO [compute_tl_bounds]: Calculating static bounds for "profiles"
2022-04-15 08:12:58,174 INFO [compute_tl_bounds]: Calculating static bounds for "comparisons"
2022-04-15 08:12:58,203 INFO [compute_tl_bounds]: Calculating traffic light alerts for "profiles"
2022-04-15 08:12:58,285 INFO [compute_tl_bounds]: Calculating traffic light alerts for "comparisons"
2022-04-15 08:12:58,325 INFO [apply_func]: Generating traffic light alerts summary.
2022-04-15 08:12:58,363 INFO [alerts_summary]: Combining alerts into artificial variable "_AGGREGATE_"

Since we've asked popmon to calculate stability_metrics on the set of training period histograms without specifying a reference, it will check whether all periods within the training period have similar values and distributions, as compared to all other periods within the training period.

In [15]:
# let's look at what content can be found in the datastore
datastore.keys()
Out[15]:
dict_keys(['hists', 'split_hists', 'comparisons', 'profiles', '08546e4a-2d6d-4e81-b81c-671ae2806d58', 'dynamic_bounds', '1323e273-0562-4924-93d5-dc54d7966970', 'dynamic_bounds_comparisons', 'static_bounds', '70e37164-47bd-4027-8807-173461b65def', 'traffic_lights', 'static_bounds_comparisons', 'd043a793-9735-4d7a-8693-4f1eff423a3d', 'alerts'])
In [16]:
# you can access the datastore like you would any other dictionary
# we will just show you one random example here
datastore['comparisons']['age']
Out[16]:
prev1_ks prev1_ks_zscore prev1_ks_pvalue prev1_pearson prev1_chi2 prev1_chi2_norm prev1_chi2_zscore prev1_chi2_pvalue prev1_chi2_max_residual prev1_chi2_spike_count prev1_max_prob_diff prev1_unknown_labels ref_ks ref_ks_zscore ref_ks_pvalue ref_pearson ref_chi2 ref_chi2_norm ref_chi2_zscore ref_chi2_pvalue ref_chi2_max_residual ref_chi2_spike_count ref_max_prob_diff ref_unknown_labels ref_max_prob_diff_std ref_max_prob_diff_mean ref_max_prob_diff_pull mean_trend10_zscore
date
2015-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.442418 -2.312559 0.989627 0.446020 29.987816 0.749695 -1.153065 0.875558 2.368508 0.0 0.065737 NaN 0.0251 0.077203 -0.456798 NaN
2015-02 0.946779 0.435895 0.331457 -5.061864e-17 33.504079 1.015275 0.143809 0.442826 2.001351 0.0 0.100000 NaN 0.918138 0.336679 0.368180 0.258062 30.261057 0.756526 -1.116126 0.867816 2.732871 0.0 0.080237 NaN 0.0251 0.077203 0.120885 NaN
2015-03 0.665432 -0.731410 0.767735 -7.380564e-02 31.669907 0.959694 -0.083522 0.533282 2.510413 0.0 0.148148 NaN 0.462763 -2.118823 0.982947 0.469252 27.002309 0.675058 -1.572555 0.942089 2.296013 0.0 0.104670 NaN 0.0251 0.077203 1.094311 NaN
2015-04 0.583586 -1.201607 0.885242 2.000875e-02 26.657297 0.952046 -0.092784 0.536962 2.115063 0.0 0.148148 NaN 0.907939 0.300592 0.381863 0.346271 29.532473 0.738312 -1.215136 0.887843 2.379007 0.0 0.083427 NaN 0.0251 0.077203 0.247989 NaN
2015-05 0.557784 -1.370834 0.914787 3.796117e-01 17.787931 0.684151 -1.191912 0.883352 1.414657 0.0 0.102217 NaN 0.652802 -0.798327 0.787660 0.487455 22.740538 0.568513 -2.230893 0.987156 2.489760 0.0 0.102358 NaN 0.0251 0.077203 1.002200 NaN
2015-06 1.332077 1.575995 0.057514 -4.175666e-02 27.576876 0.919229 -0.234877 0.592848 2.237766 0.0 0.137931 NaN 1.183783 1.168678 0.121267 0.580993 22.905891 0.572647 -2.203816 0.986231 1.824449 0.0 0.074169 NaN 0.0251 0.077203 -0.120885 NaN
2015-07 0.720158 -0.460633 0.677469 9.555846e-02 26.051765 0.814118 -0.710531 0.761312 1.940285 0.0 0.117647 NaN 0.586110 -1.185666 0.882123 0.114513 28.046507 0.701163 -1.422374 0.922541 2.154558 0.0 0.054809 NaN 0.0251 0.077203 -0.892202 NaN
2015-08 0.656204 -0.780122 0.782340 -2.902757e-01 29.590741 0.896689 -0.352139 0.637633 1.517574 0.0 0.074074 NaN 0.321042 -3.893301 0.999951 0.464718 20.608440 0.515211 -2.593056 0.995244 1.940474 0.0 0.042454 NaN 0.0251 0.077203 -1.384456 NaN


Depending on the way your company prefers to work, you can store these objects as (pickle) files, or write them to a database. This way you can retrieve and expand them every time you use the model on a new batch of data.

In [17]:
# save the objects as pickles (we could also create jsons and/or store in a database)
pickle.dump(hists, open(f"all_hist.pkl", "wb"))
pickle.dump(monitoring_rules, open(f"monitoring_rules.pkl", "wb"))
pickle.dump(datastore, open(f"datastore.pkl", "wb"))


Now let's try to use the monitoring on our first new batch of data

Earlier, we prepared the scoring batches. Using those batches of data we are ready to check for drift, compared to the histograms we made based on the training data. So let's start doing that now.

Create the histograms of the new batch, using the same bin specifications
In [18]:
# start by generating the histograms on the new dataset, using the same bin specifications to make the histograms comparable
new_hists = batch_1.pm_make_histograms(time_axis="date", bin_specs=bin_specs)
new_hists
2022-04-15 08:12:58,655 INFO [histogram_filler_base]: Filling 8 specified histograms. auto-binning.
100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 149.55it/s]
Out[18]:
{'date:age': <Categorize values=SparselyBin size=1,
 'date:balance': <Categorize values=Categorize size=1,
 'date:currency': <Categorize values=Categorize size=1,
 'date:eyeColor': <Categorize values=Categorize size=1,
 'date:favoriteFruit': <Categorize values=Categorize size=1,
 'date:gender': <Categorize values=Categorize size=1,
 'date:latitude': <Categorize values=SparselyBin size=1,
 'date:transaction': <Categorize values=SparselyBin size=1}


Now that we have histogram objects for the new scoring batch and we already had created those for the training data, we can calculate the metrics of the new batch, using our traindata as the reference data.

In [19]:
batch_datastore = popmon.stability_metrics(hists=new_hists, monitoring_rules=monitoring_rules, reference_type='external', reference=hists)
2022-04-15 08:12:58,786 INFO [hist_splitter]: Splitting histograms "hists" as "split_hists"
2022-04-15 08:12:58,796 INFO [hist_comparer]: Comparing "split_hists" with rolling sum of 1 previous histogram(s).
2022-04-15 08:12:58,851 INFO [hist_splitter]: Splitting histograms "ref_hists" as "split_ref_hists"
2022-04-15 08:12:58,865 INFO [hist_comparer]: Comparing "split_hists" with reference "split_ref_hists"
2022-04-15 08:12:59,019 INFO [pull_calculator]: Comparing "comparisons" with median/mad of reference "comparisons"
2022-04-15 08:12:59,102 INFO [hist_profiler]: Profiling histograms "split_hists" as "profiles"
2022-04-15 08:12:59,154 INFO [hist_profiler]: Profiling histograms "split_ref_hists" as "ref_profiles"
2022-04-15 08:12:59,274 INFO [pull_calculator]: Comparing "profiles" with reference "ref_profiles"
2022-04-15 08:12:59,547 INFO [apply_func]: Computing significance of (rolling) trend in means of features
2022-04-15 08:12:59,575 INFO [compute_tl_bounds]: Calculating static bounds for "profiles"
2022-04-15 08:12:59,765 INFO [compute_tl_bounds]: Calculating static bounds for "comparisons"
2022-04-15 08:12:59,803 INFO [compute_tl_bounds]: Calculating traffic light alerts for "profiles"
2022-04-15 08:12:59,933 INFO [compute_tl_bounds]: Calculating traffic light alerts for "comparisons"
2022-04-15 08:12:59,973 INFO [apply_func]: Generating traffic light alerts summary.
2022-04-15 08:13:00,003 INFO [alerts_summary]: Combining alerts into artificial variable "_AGGREGATE_"

To get a deeper understanding of what we end up with now, let's have a deeper look at four objects in the batch_datascore we just created:

  • profiles We introduced these earlier on: This contains a lot of statistics on the distribution of the features in the new batch: The number of values, mean, min, max, etc. And since the batch has multiple observations, it also contains information on the deviation of each of these statistics: the standard deviation of the mean, of the number of values, of the min,... And finally, it contains a comparison of these statistics with those in the reference data, resulting in all _pull metrics in the profiles. These are used to specify what profile statistics seem really different from the reference data and get a yellow or red traffic light.
  • comparisons We also introduced this earlier: This contains actual comparisons between the distributions in the new batch compared to the reference data: It holds many test statistics like pearson, chi square and Kolmogorov-Smirnov and the statistical significance of the values. For alerting, also in the comparisons _pull metrics are added to specify which comparisons are big enough to get a yellow or red traffic light.
  • alerts This object summarizes and guides us if there are significant issues with the new data. The monitoring_rules we specified earlier are used to evaluate whether thresholds for pull, z-scores or custom threshold values like minimum or maximum age are met.
  • traffic lights These indicate which of the pull or zscore or customer thresholds for feature statistics triggered the traffic light. In this overview of all considered values for alerting, a 0 indicates no alert ('green' traffic light), a 1 indicates a yellow traffic light and a 2 is most severe: a red traffic light.

Let's explore the profiles, comparisons and alerts now for one feature, age:

In [20]:
# let's checkout the same element in the datastore as we saw before
batch_datastore['profiles']['age']
Out[20]:
histogram filled overflow underflow distinct nan count most_probable_value mean std min max p01 p05 p16 p50 p84 p95 p99 filled_std overflow_std underflow_std distinct_std nan_std count_std most_probable_value_std mean_std std_std min_std max_std p01_std p05_std p16_std p50_std p84_std p95_std p99_std filled_mean overflow_mean underflow_mean distinct_mean nan_mean count_mean most_probable_value_mean mean_mean std_mean min_mean max_mean p01_mean p05_mean p16_mean p50_mean p84_mean p95_mean p99_mean filled_pull overflow_pull underflow_pull distinct_pull nan_pull count_pull most_probable_value_pull mean_pull std_pull min_pull max_pull p01_pull p05_pull p16_pull p50_pull p84_pull p95_pull p99_pull
date
2015-09 <SparselyBin binWidth=2.0 bins=Count nanflow=C... 38.0 0 0 24 0.0 38.0 32.5 49.078947 25.616228 10.5 88.5 10.5 13.033333 22.564 50.5 79.92 87.1 88.5 4.768058 0.0 0.0 2.0 0.0 4.768058 22.681215 3.169404 1.311268 1.0 1.561249 1.0 1.58015 4.268252 4.193249 3.33033 1.484152 1.561249 31.625 0.0 0.0 23.0 0.0 31.625 44.25 48.889361 21.438507 11.5 88.25 11.5 14.125 25.094167 48.833333 73.27 85.891667 88.25 1.337022 NaN NaN 0.5 NaN 1.337022 -0.51805 0.059818 3.186017 -1.0 0.160128 -1.0 -0.690863 -0.592788 0.397464 1.996799 0.814157 0.160128
In [21]:
# let's checkout the same element in the datastore as we saw before
batch_datastore['comparisons']['age']
Out[21]:
prev1_ks prev1_ks_zscore prev1_ks_pvalue prev1_pearson prev1_chi2 prev1_chi2_norm prev1_chi2_zscore prev1_chi2_pvalue prev1_chi2_max_residual prev1_chi2_spike_count prev1_max_prob_diff prev1_unknown_labels ref_ks ref_ks_zscore ref_ks_pvalue ref_pearson ref_chi2 ref_chi2_norm ref_chi2_zscore ref_chi2_pvalue ref_chi2_max_residual ref_chi2_spike_count ref_max_prob_diff ref_unknown_labels ref_max_prob_diff_std ref_max_prob_diff_mean ref_max_prob_diff_pull mean_trend10_zscore
date
2015-09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.981093 0.550987 0.290821 0.092822 39.948714 0.998718 0.068895 0.472537 2.368508 0.0 0.063137 NaN 0.0 0.063137 NaN NaN
In [22]:
# Let's look at the alerting summary: age has 2 yellow traffic lights . 
# Since green is coded as a 0, yellow as 1 and red as 2, the worst value is 1, hence yellow. 
batch_datastore['alerts']['age']
Out[22]:
worst n_red n_yellow n_green
date
2015-09 1.0 0.0 2.0 25.0
In [23]:
# The traffic_lights object tells what triggered the yellow traffic sign. 
# - The min value was in the 0-18 yellow range we specified ourselves. 
# - The pull value of the standard deviation indicates a difference in spread in the new data.  
batch_datastore['traffic_lights']['age']
Out[23]:
min max filled_pull overflow_pull underflow_pull distinct_pull nan_pull count_pull most_probable_value_pull mean_pull std_pull min_pull max_pull p01_pull p05_pull p16_pull p50_pull p84_pull p95_pull p99_pull ref_max_prob_diff_pull prev1_ks_zscore prev1_chi2_zscore ref_ks_zscore ref_chi2_zscore mean_trend10_zscore ref_unknown_labels
date
2015-09 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


In the same way we can create the monitoring report to give us a bit more insightful information

Since the datastore can be a bit overwhelming, you can also create a report showing you all the information in a visual way, making it possible to get a better overview and click around the features to get a sense of all the possibilities.

In [24]:
batch_monitoring_report = popmon.stability_report(hists=new_hists, monitoring_rules=monitoring_rules, reference_type='external', reference=hists)
2022-04-15 08:13:00,164 INFO [hist_splitter]: Splitting histograms "hists" as "split_hists"
2022-04-15 08:13:00,181 INFO [hist_comparer]: Comparing "split_hists" with rolling sum of 1 previous histogram(s).
2022-04-15 08:13:00,278 INFO [hist_splitter]: Splitting histograms "ref_hists" as "split_ref_hists"
2022-04-15 08:13:00,286 INFO [hist_comparer]: Comparing "split_hists" with reference "split_ref_hists"
2022-04-15 08:13:00,470 INFO [pull_calculator]: Comparing "comparisons" with median/mad of reference "comparisons"
2022-04-15 08:13:00,556 INFO [hist_profiler]: Profiling histograms "split_hists" as "profiles"
2022-04-15 08:13:00,604 INFO [hist_profiler]: Profiling histograms "split_ref_hists" as "ref_profiles"
2022-04-15 08:13:00,712 INFO [pull_calculator]: Comparing "profiles" with reference "ref_profiles"
2022-04-15 08:13:01,004 INFO [apply_func]: Computing significance of (rolling) trend in means of features
2022-04-15 08:13:01,044 INFO [compute_tl_bounds]: Calculating static bounds for "profiles"
2022-04-15 08:13:01,224 INFO [compute_tl_bounds]: Calculating static bounds for "comparisons"
2022-04-15 08:13:01,261 INFO [compute_tl_bounds]: Calculating traffic light alerts for "profiles"
2022-04-15 08:13:01,314 INFO [compute_tl_bounds]: Calculating traffic light alerts for "comparisons"
2022-04-15 08:13:01,358 INFO [apply_func]: Generating traffic light alerts summary.
2022-04-15 08:13:01,396 INFO [alerts_summary]: Combining alerts into artificial variable "_AGGREGATE_"
2022-04-15 08:13:01,404 INFO [report_pipelines]: Generating report "html_report".
2022-04-15 08:13:01,404 INFO [histogram_section]: Generating section "Histograms".
100%|█████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.47it/s]
2022-04-15 08:13:06,859 INFO [traffic_light_section_generator]: Generating section "Traffic Lights". skip empty plots: True
100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 125.59it/s]
2022-04-15 08:13:06,927 INFO [alert_section_generator]: Generating section "Alerts". skip empty plots: True
  0%|                                                                         | 0/9 [00:00<?, ?it/s]C:\ProgramData\Miniconda3\envs\popmon_env\lib\site-packages\popmon\visualization\utils.py:186: RuntimeWarning: invalid value encountered in long_scalars
  a = data[c1][c2] / row_max
100%|████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 342.60it/s]
2022-04-15 08:13:06,953 INFO [section_generator]: Generating section "Comparisons". skip empty plots: True
100%|█████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.08it/s]
2022-04-15 08:13:10,813 INFO [section_generator]: Generating section "Profiles". skip empty plots: True
100%|█████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.03it/s]
In [25]:
batch_monitoring_report
Out[25]:

Next steps

We know we've only scratched the surface of all the beautiful and smart things popmon has to offer. But this was just meant to give you a quickstart in using this package. The possible next steps, as we see it, would be:

  • finding out which metrics are used and what they do/mean, which we explained here
  • finding out if the default settings of these metrics apply to your model

This last step is a very difficult one, in our experience we haven't found a reason yet to deflect from the default values used by popmon. The only reason was to create fixed minimum and maximum values for a feature like age. But we haven't found another reason to adjust the default settings yet. But please let us know if you do and why!

We hope this article helped you getting started using popmon for the monitoring of model drift. If you are looking for more example notebooks, ING created some really nice ones as well:

  • Basic tutorial
  • Advanced tutorial
  • Incremental datasets
  • Report interpretation

You can also checkout our other two articles in this series:

  • Model drift, automatic retraining and how not to ruin your models
  • The statistics underlying the popmon hood

Good luck, and more important, have fun!