You might not be aware of this, but you have a higher probability of attending (multiple) marriages in 2022 than in previous years, so make sure you have a sufficiently filled wardrobe for all the parties ahead. During covid-19 local restrictions prohibited the attendance of marriages in many countries. Marriages were only allowed in small groups, and many were postphoned or cancelled as a result. Therefore we expect a catch-up of postponed marriages in 2022. But how much weddings do you have to prepare for?
In this blog we use timeseries analyses to model the trend in Dutch marriages and extrapolate this to the first two Covid-19 years (2020 and 2021). Doing so we are able to shed some light on the expected amount of marriages postponed, and what might be ahead of us in 2022, of course dependent on local restrictions next year. We use public available data provided by Statistics Netherlands on the number of marriages in past years.
First, we set up our workbook environment with the required packages to perform timeseries analyses. Besides the basics for data handling (pandas, numpy) and graphics (seaborn and mathplotlib) we use specific packages for our task at hand:
# Basics
import re, numpy as np, pandas as pd
import datetime
# Statsmodel package for inference
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.seasonal import seasonal_decompose
# DARTS for timeseries modeling
from darts import TimeSeries
from darts.models import AutoARIMA, ARIMA
from darts.models import ExponentialSmoothing
from darts.models import Prophet
from darts.models import NBEATSModel
from darts.utils.utils import ModelMode
from darts.metrics import mape, mae
# Graphics
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
The initial dataset that Statistics Netherlands provides has data on weekday level. As we do not require that level of granularity we performed several data cleaning steps to have data on weeklevel from 1995 untill 2019. Aggregating the dataset to weeks instead of using weekday level also has a major advantage: in timeseries you are often faced with multiple seasonal effects in your data. Hence, you can imagine that getting married on a Friday is more attractive than on a Tuesday. The same holds for weeks and months. Fitting a timeseries model with multiple seasonal effects can be quite complicated. And, we do not need predictions on a day-level to answer our main question. Therefore we aggregate to weeks.
Next we removed outliers from our dataset. As we explain in more detail in this article people tend to get married on specific magical dates like '2012-12-12' or '2008-08-08', a calendar heatmap clearly displays most wanted marriage dates. As a result you are faced with large spikes in your data that are impossible to model. Values above 1.5 times the interquartile range were substituted by the median value of that month and year. Preprocessed data can be downloaded here. As you can see we have years, weeknumbers (iso), the date on which the week starts and the total number of marriages in that week.
# Read marriage data into a pandas dataframe
path = '/dbfs/mnt/timeseries/NL_marriages_1995_2019.csv'
marriages = pd.read_csv(path, sep=';', parse_dates=['week_startdate'])
marriages.head(5)
year | week | week_startdate | number_of_marriages | |
---|---|---|---|---|
0 | 1995 | 1 | 1995-01-02 | 616 |
1 | 1995 | 2 | 1995-01-09 | 616 |
2 | 1995 | 3 | 1995-01-16 | 621 |
3 | 1995 | 4 | 1995-01-23 | 675 |
4 | 1995 | 5 | 1995-01-30 | 644 |
So, let's dive into the dataset we have and see if we can get some understanding on the trend and seasonal effects. Yes, there are many metrics we can use, but the most appealing method is to show a visual of Dutch marriages by week from 1995 - 2019. Straight away we see a very volatile dataset. Within a year there seem to be two spikes, probably one just before the summer holidays and one after. On a higher level we observe a downward trend in the absolute number of marriages.
def plot_df(df, x, y, title="", xlabel='', ylabel='Number of marriages', dpi=100):
plt.figure(figsize=(16,5), dpi=dpi)
plt.plot(x, y, color='tab:blue')
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.show()
plot_df(df=marriages, x=marriages.week_startdate, y=marriages.number_of_marriages, title='Weekly Dutch marriages from 1995-2019')
To get a better view on the variance across years and weeks we draw boxplots for both. For years we see that the variance before entering this century was somewhat higher than it is in recent years. See for instance 1997, the size of the colored box (the interquartile range) is much bigger than recent years. When we look at weekdata you can now clearly see the popularity of marriages just before and after the summer holidays. In the first week of a year the number of marriages is at the lowest of the year with hardly any differences between years.
fig, (axes1, axes2) = plt.subplots(2, 1, figsize=(15,9), dpi= 80)
sns.boxplot(x=marriages['year'], y=marriages['number_of_marriages'], data=marriages, ax=axes1)
sns.boxplot(x=marriages['week'], y=marriages['number_of_marriages'], data=marriages, ax=axes2)
axes1.set_title('Boxplots of marriages per year')
axes2.set_title('Boxplots of marriages per week')
fig.subplots_adjust(hspace=.3)
Now that we've got some understanding of the high level trend and what happens from year to year, it's time to perform some analysis. Using the TSA class from statsmodels we can break down the data into a trend, a seasonal effect and the remaining data points (noise). Why is this important? In timeseries modeling it is best practice to work with a so-called stationary dataset. This means that the way in which all the different parts (trend, seasonal, noise) lead to the observed values do not change over time; or formulated differently: the observed values are the result of a certain formula. This formula can be additive or multiplicative which means that the different parts of the formula will be added or multiplied. Both the trend and the seasonality can be additive or multiplicative. An additive trend indicates a linear trend while multiplicative indicates a curved trend line. When the seasonality is additive it means that the seasonality stays the same over time. With multiplicative seasonality the seasonality peaks are increasing or decreasing over time. The model parameter in the seasonal_decompose function refers to the type of seasonal component.
When dealing with a non-stationary dataset transformations need to be done to reach a stationary dataset. Many timeseries libraries do this by default. Let's take a look at our data.
# seasonal_decompose requires an input with a string containing the dates and the values for marriages
marriages['year_week'] = marriages['year'].astype(str) + '_' + marriages['week'].astype(str)
marriages_short = marriages[['year_week','number_of_marriages']]
# Make a decomposition where the actual values are additive: trend + seasonal + noise
result = seasonal_decompose(marriages_short['number_of_marriages'], period= 52, model='additive')
plt.rcParams["figure.figsize"] = (15,10)
result.plot(observed=True)
As you can see in the output of seasonal_decompose
there is a downward trend, mainly for early years. Next we see the isolated distinct seasonal effect and at the bottom the remaining data points. In the noise we do not see any pattern, which is good because otherwise the number of marriages would be hard to model. We can perform a statistical test to see if we are dealing with a stationary dataset. The most widely used test for stationarity is the 'Augmented Dickey-Fuller test'. It determines how strongly a timeseries is defined by a trend. The null hypothesis for the ADF-test is that the time series is not stationary, so the formula of observed values is changing over time. If the ADF-test results in a p-value below 0.05 we are dealing with a stationary dataset.
Alternatively we can also use the 'Kwiatkowski-Phillips-Schmidt-Shin' (KPSS) test, which will test the null hypothesis that a trend is stationary. The KPSS test differs from the ADF test in the sense that it tests for stationarity of a serie around a trend. So the observed values may increase or decrease over time, as long as the formula leading up to the observed values remains the same.
# We test for stationarity around a trend: regression = 'ct'
# ADF Test
result = adfuller(marriages['number_of_marriages'], autolag='AIC', regression = 'ct')
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
print(f'ADF result: the series is {"not " if result[1] > 0.05 else ""}stationary')
# KPSS Test
result = kpss(marriages['number_of_marriages'], regression='ct')
print('\nKPSS Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print(f'KPSS result: the series is {"not " if result[1] < 0.05 else ""}stationary')
In the output of both the ADF- and KPSS- test you can see that our dataset is stationary. This means that no transformations are necessary for this dataset.
Next, we are going to start modeling. We use the DARTS library to model our timeseries as this library embeds many popular methods. DARTS requires a timeseries object with a date and an observed value. An overview of all available forecasting models within this library can be found here. Next we split our dataset into a train part (1995-2018) and a validation year (2019) to see how our model performs.
# create timeseries object for DARTS
series_week = TimeSeries.from_dataframe(marriages, 'week_startdate', 'number_of_marriages')
# Create train and validation set
train, val = series_week.split_before(pd.Timestamp('20190101'))
A classic when it comes to timeseries modeling is the 'autoregressive integrated moving average' analysis, or ARIMA. Within DARTS you can either use ARIMA or Auto-ARIMA, which performs a grid-search for best results. We already learned that we have a stationary dataset, that it contains seasonal characteristics and that it has a small downward trend. There are various ways to get a notion of what values to use for best results. An Auto-ARIMA model is able to use a brute-force gridsearch on available parameters. What we know is that our dataset has a clear seasonal effect, every 52 week we roughly observe the same pattern. We provide that information to the model and keep all other parameters at their defaults.
model_arima = AutoARIMA(
m = 52, # the number of periods in each season
seasonal=True,
maxiter = 20,
n_jobs = -1,
trace = True,
suppress_warnings=True
)
model_arima.fit(train)
prediction_arima = model_arima.predict(len(val))
print("Mean absolute percentage error for Auto-ARIMA: {:.2f}%.".format(
mape(series_week, prediction_arima)))
print("Mean absolute error for Auto-ARIMA: {:.0f}.".format(
mae(series_week, prediction_arima)))
It will take some time for an Auto-ARIMA model to finish. You can decrease the number of iterations for each run or use an alternative method (method='nm' for the Nelder-Mead approach) to speed things up. But if you happen to have a large computing cluster at your disposal and half an hour to spare you can pretty much stick to the default settings. After about 20 minutes the AIC is not decreasing significantly and we are left with a model that produces a 12.4% average error rate on the validation dataset, roughly 150 marriages a week. Below we've put the forecast on top on the actual values to get a better feeling of the difference the model produces; please note: we did not train on 2020. Overall the predicted values are lower than the actual values and the model seems to find it difficult to predict the peaks for spring and september.
We feel there is room for improvement, so we will test a range of different models to see if we can improve this first result.
plt.rcParams["figure.figsize"] = (20,5)
series_week.plot(label='actual', color='grey' , lw=1.3)
prediction_arima.plot(color='red' , lw=1.3, label='Auto-ARIMA MAPE={:.2f}%'.format(mape(series_week, prediction_arima)))
plt.legend(loc='center right', bbox_to_anchor=(1.15, 0.5))
plt.title('Actual and forecast values of Auto-ARIMA with DARTS')
plt.xlim('20150101', '20200101')
plt.xlabel('')
display()
The next method to forecast future marriages is Exponential Smoothing. Like ARIMA the Exponential Smoothing model is mainly using past observations but it also uses a decreasing weight for those observations. So the further an observation is from today, the lower the weight of that observation is for the prediction of tomorrow. It has the ability to account for seasonality and trends in the data, so it might be very suitable for our dataset. We already know from the seasonal decomposition plots that the additive seasonality works very well. As for the estimated trend we use so-called dampening because the main marriages trend we saw earlier is not a straight linear line downwards but it flattens out at the end.
# Model parameters
seasonal_periods = 52
trend = ModelMode.ADDITIVE
seasonal = ModelMode.ADDITIVE
damped = True
# Model specification
model_expsmthng = ExponentialSmoothing( damped = damped
, seasonal_periods = seasonal_periods
, seasonal = seasonal
, trend=trend)
model_expsmthng.fit(train)
prediction_expsmthng = model_expsmthng.predict(52)
print("Mean absolute percentage error for Exponential Smoothing: {:.2f}%.".format(
mape(series_week, prediction_expsmthng)))
print("Mean absolute error for Exponential Smoothing: {:.0f}.".format(
mae(series_week, prediction_expsmthng)))
With the Exponential Smoothing model an average percentage error of 8.62% has been reached, which is certainly not bad. In the visual below it is clear that the model is able to predict the spring peaks better, and also the september peak is getting closer. So, it is clear that this model is much better than Auto-ARIMA but still has an average error of 102 marriages per week.
plt.rcParams["figure.figsize"] = (20,5)
series_week.plot(label='actual', color='grey' , lw=1.3)
prediction_expsmthng.plot(color='red' , lw=1.3, label='Exponential smoothing MAPE={:.2f}%'.format(mape(series_week, prediction_expsmthng)))
plt.legend(loc='center right', bbox_to_anchor=(1.2, 0.5))
plt.title('Actual and forecast values of Exponential smoothing with DARTS')
plt.xlim('20150101', '20200101')
plt.xlabel('')
display()
Next in line is Facebook Prophet. This architecture received quite some attention a few years ago, So I'm curious if we can furter improve results. The model uses an additive approach for predicting the next case and has options to provide holiday information for best seasonal fits. According to the characteristics of this model it should work well with our dataset: strong seasonal effects, a great outlier handler and works best with quite some historical data.
# Model parameters Prophet
import holidays
country_holidays = holidays.NL()
# Model specification
model_prophet = Prophet(weekly_seasonality=True
,country_holidays = 'NL'
)
model_prophet.add_seasonality('weekly_seasonality', seasonal_periods = 52, fourier_order = 2, mode='additive')
model_prophet.fit(train)
prediction_prophet = model_prophet.predict(52)
print("Mean absolute percentage error for Facebook Prophet: {:.2f}%.".format(
mape(series_week, prediction_prophet)))
print("Mean absolute error for Facebook Prophet: {:.0f}.".format(
mae(series_week, prediction_prophet)))
Prophet has a MAPE of 9.8%, a bit higher than the Exponential Smoothing model. In the visual below you can see that the prediction is more smooth than previous models. However, the predictions for the September marriages are still too low.
# Display Prophet predictions
series_week.plot(label='actual', color='grey' , lw=1.3)
prediction_prophet.plot(color='red' , lw=1.3, label='Prophet MAPE={:.2f}%'.format(mape(series_week, prediction_prophet)))
plt.legend(loc='center right', bbox_to_anchor=(1.15, 0.5))
plt.title('Actual and forecast values of using Prophet with DARTS')
plt.xlim('20150101', '20200101')
plt.xlabel('')
display()
DARTS has many more models to play around with. The last model we are testing in this blog is a deep learning model based on the NBEATS architecture or Neural Basis Expansion Analysis for Interpretable Time Series Forecasting. We first heard about NBEATS in this data Skeptic podcast and decided to give it a go. The architecture works very different than models we've used before. The model will take an entire window of past values, and computes many future values in one iteration. It will not only predict forwards but also backwards (backcasting). Then it will choose a different window to minimise the prediction error untill no further improvement can be made.
# Model parameters NBEATS
input_chunk_length=104
output_chunk_length=52
generic_architecture=True
n_epochs=50
# Model specification
model_nbeats = NBEATSModel(input_chunk_length = input_chunk_length
,output_chunk_length = output_chunk_length
,n_epochs=n_epochs
,generic_architecture=generic_architecture
)
model_nbeats.fit(train)
prediction_nbeats = model_nbeats.predict(52)
print("Mean absolute percentage error for NBEATS: {:.2f}%.".format(
mape(series_week, prediction_nbeats)))
print("Mean absolute error for NBEATS: {:.0f}.".format(
mae(series_week, prediction_nbeats)))
After 50 epochs of training we reach an average percentage error of 9.85% which is comparable to earlier models. We played around with different parameters of the NBEATS model but failed to produce a stunning result. From the visual below it is clear that the model has difficulty in the months leading up to spring, from the summer onwards the prediction is almost flawless.
# Display NBEATS predictions
series_week.plot(label='actual', color='grey' , lw=1.3)
prediction_nbeats.plot(color='red' , lw=1.3, label='NBEATS MAPE={:.2f}%'.format(mape(series_week, prediction_nbeats)))
plt.legend(loc='center right', bbox_to_anchor=(1.15, 0.5))
plt.title('Actual and forecast values of NBEATS prediction')
plt.xlim('20150101', '20200101')
plt.xlabel('')
display()
This means we have a winning timeseries model for predicting Dutch marriages based upon data from 1995 to 2019. The Exponential Smoothing model has the lowest MAPE of all tested models. From the graph below it seems that this model is better able to accomodate the peaks in the number of marriages in june/july and september than the other models are.
plt.rcParams["figure.figsize"] = (10,5)
expsm_err = mape(series_week, prediction_expsmthng)
prophet_err = mape(series_week, prediction_prophet)
nbeats_err = mape(series_week, prediction_nbeats)
arima_err = mape(series_week, prediction_arima)
# Display predictions of various models
series_week.plot(label='Actual', color='#2CA02C')
prediction_expsmthng.plot(label='Exponential Smoothing MAPE={:.2f}%'.format(expsm_err), lw=2, color='#ACACAC')
prediction_prophet.plot(label='Prophet MAPE={:.2f}%'.format(prophet_err), lw=2, color='#7C3F00')
prediction_nbeats.plot(color='#F5A507', lw=2, label='NBEATS MAPE={:.2f}%'.format(nbeats_err))
prediction_arima.plot(color='#003D7C', lw=2, label='ARIMA MAPE={:.2f}%'.format(arima_err))
plt.ylabel("Number of marriages")
plt.xlabel("")
plt.legend(loc='right', bbox_to_anchor=(1.4, 0.5))
plt.xlim('20190101', '20191231')
display()
Now that we have our winning model we can predict future marriages for 2020 and 2021. The model is not aware of the covid-19 pandemic and therefore will keep on predicting a marriagerate based on past observations. By now we know better ;) With these predictions we can calculate the difference between the predicted number of marriages and the actual marriages that took place. We use the winning model to predict cases for 2020 and 2021.
# Model parameters
seasonal_periods = 52
trend = ModelMode.ADDITIVE
seasonal = ModelMode.ADDITIVE
damped = True
# Model specification
model_expsmthng = ExponentialSmoothing( damped = damped
, seasonal_periods = seasonal_periods
, seasonal = seasonal
, trend=trend)
# Fit model on entire dataset
model_expsmthng.fit(series_week)
# Predict values for 2020 and 2021
prediction_2020_2021_expsmthng = model_expsmthng.predict(104)
# Display predictions 2020 and 2021
series_week.plot(label='actual', color='grey' , lw=1.3)
prediction_2020_2021_expsmthng.plot(color='red' , lw=1.3, label='Predicted marriages')
plt.legend(loc='center right', bbox_to_anchor=(1.15, 0.5))
plt.title('Actual and forecast values of Exponential Smoothing model')
plt.xlabel('')
display()
Unfortunately, Statistics Netherlands only provides the marriages at a detailed level, such as the marriages on a daily basis, once every two years. That means that at the moment marriages for 2020 and 2021 are only provided at a less detailed level, i.e. on a monthly basis. Since our main goal was to estimate the number of postponed marriages for the first two Covid-19 years and not specifically in a certain week, we will make a cumulative comparison by year.
# Read 2020/2021 marriages actuals into a pandas dataframe
path = '/dbfs/mnt/timeseries/NL_marriages_2020_2021.csv'
marriages_2020_2021 = pd.read_csv(path, sep=';', parse_dates=['month_startdate'])
marriages_2020_2021.head(5)
year | month | month_startdate | number_of_marriages | |
---|---|---|---|---|
0 | 2020 | 1 | 2020-01-01 | 2756 |
1 | 2020 | 2 | 2020-02-01 | 3772 |
2 | 2020 | 3 | 2020-03-01 | 2786 |
3 | 2020 | 4 | 2020-04-01 | 2298 |
4 | 2020 | 5 | 2020-05-01 | 2969 |
In the dataframe above we have collected the monthly marriages provided by Statistics Netherlands. Below we match the predicted numbers for the same years and add a cumulative column for both. Both are then plotted againts each other, the gap between the prediction and the actuals is the amount of maariages parties we've missed. It might not seem like much to the naked eye, but wait till you see the numbers.
# Predicted marriages per year for 2020 and 2021
prediction_2020_2021 = prediction_2020_2021_expsmthng.pd_dataframe().reset_index()
prediction_2020_2021['time'] = pd.to_datetime(prediction_2020_2021['time'])
prediction_2020_2021['cumulative'] = prediction_2020_2021['number_of_marriages'].cumsum()
# Actual marriages per year for 2020 and 2021
marriages_2020_2021['cumulative'] = marriages_2020_2021['number_of_marriages'].cumsum()
# Create time series objects for both actual and predicted
series_actual = TimeSeries.from_dataframe(marriages_2020_2021, 'month_startdate', 'cumulative')
series_predicted = TimeSeries.from_dataframe(prediction_2020_2021, 'time', 'cumulative')
# Plot of cumulative differences
series_actual.plot(label='Actual marriages 2020 and 2021', color='grey' , lw=1.3)
series_predicted.plot(color='red' , lw=1.3, label='Predicted marriages')
plt.legend(loc='center right', bbox_to_anchor=(1.15, 0.5))
plt.title('Predicted marriages for 2020 and 2021 compared to actual marriages')
plt.xlabel('Date')
plt.ylabel('Cumulative marriages')
display()
The data available rusn until the 1st of November for 2021, so we sum the months for which we have full availability of data and arrive at a conclusion: we've missed a little over 7.300 marriages!
# Cumulative comparison, actuals available until the end of October
date_comparison = pd.to_datetime(datetime.date(2021, 10, 31))
actual_cumulative = marriages_2020_2021['cumulative'].max()
predicted_cumulative = round(prediction_2020_2021[prediction_2020_2021['time']<=date_comparison]['cumulative'].max(),2)
diff_act_pred = round(predicted_cumulative - actual_cumulative, 2)
print("The actual marriages in 2020 and 2021 until the 1st of November: ", actual_cumulative)
print("The predicted marriages in 2020 and 2021 until the 1st of November: ", predicted_cumulative)
print("Therefore, the marriages that are probably postponed because of Corona are around: ", diff_act_pred)
According to our predictions there were 7370 fewer marriages concluded than there would otherwise if Covid-19 hadn't come up. Probably part of these people did get married already, maybe with fewer people than planned or with some restrictions. However, we think that there are also many people that postponed their marriage until they can celebrate it the way they would like to. If this would be the case, we are going to have a great festive year, fingers crossed. So in short: just make sure you have plenty of party outfits in your closet!