NLP: preparing data

Geplaatst op: september 1, 2020

Check out all our blogs in this NLP series. Notebooks and dataset are freely available from out gitlab page:

This article is written by:
Jurriaan Nagelkerke
Jurriaan Nagelkerke
jurriaan.nagelkerke@cmotions.com
Wouter van Gils
Wouter van Gils
wouter.vangils@cmotions.com

Lees meer

Bayesian inference in practice: marketing campaign evaluation (part 1)

Geplaatst op: augustus 13, 2020

The setting

For the past 5.5 years I have been working mostly in customer/marketing analytics departments. A common task is the evaluation of marketing campaigns, for example:

  • Split a homogeneous group of customers (e.g., 'new') into 2 groups, A & B
  • Send group A an email with a €10,- coupon
  • Send group B the same email without the €10,- coupon
  • Compare the difference in conversion rate
$$ \text{conversion rate} = \frac{\text{# customers whom responded}}{\text{# customers whom were send the email}} $$

The idea is that, given big enough groups, random fluctuations will cancel out, and the only thing that is different between these groups is the coupon. Hence, the difference in conversion rate between these groups can be attributed to the coupon.

Although this sounds simple, it's quite amazing how easily you can misinterpret and misuse the results, depending on the evaluation technique. In this first part I will demonstrate how, especially in a commercial business context, you can be fooled by randomness. In part 2 I will - hopefully - convince you that Bayesian inference is the most natural and informative evaluation technique to handle this type of challenge. Python code is included so you can easily apply, modify, and reproduce my findings.

Random fluctuations

Campaign evaluation is essentially an experiment, and, like any experiment, it suffers from random fluctuations. Characteristics of random fluctuations are that they cannot be predicted, however, given enough data we expect them to occur evenly for each group. For an introduction on randomness have a look at wikipedia.

Enough data

The key question at this point is: how much data do we need to not be fooled by randomness?

To answer this question and to learn more about randomness we are going to simulate a marketing campaign as described in the introduction. To enhance the experimental feeling, we refer to the group whom receives the coupon as the test group while the other group is referred to as the control group.

  • Group A (test) receives the email with the coupon and consists of 540 customers
  • Group B (control) receives the email without the coupon and consists of 60 customers

The 'max 10% control group rule' is typical in a commercial business context since the potential of missing out on additional turnover (those coupons should work, right?!) must be prevented at all costs. We will come back to this point later.

The experiment

One of the interesting parts of simulation is that we can decide how the different groups will perform. Since we expect an effect of the coupon, we decide on the following set-up:

  • Group A (test) will have an underlying probability of conversion of 48%
  • Group B (control) will have an underlying probability of conversion of 45%
  • Hence, we expect an uplift in conversion of 3 percentage points due to the coupon

To create the simulated data, we use the Bernoulli distribution. This distribution is commonly used for coin-flip simulations, which is essentially what we are doing. The difference is that instead of flipping a coin with a 50% chance on success (a fair coin), we flip the coin with a 48% chance or 45% chance on success. More information on the Bernoulli distribution can be found here.

In [1]:
import numpy as np
from scipy.stats import bernoulli
In [2]:
np.random.seed(4)
A = bernoulli.rvs(p=0.48, size=540)
B = bernoulli.rvs(p=0.45, size=60)
In [3]:
print(f"The length of A is {len(A)}, a glimpse of the first few records: {A[:15]}")
print(f"The length of B is {len(B)}, a glimpse of the first few records: {B[:15]}")
The length of A is 540, a glimpse of the first few records: [1 1 1 1 1 0 1 0 0 0 1 0 1 1 0]
The length of B is 60, a glimpse of the first few records: [0 0 1 1 1 0 0 1 1 1 1 1 1 1 0]

Results

Now that we have our data, which are simply 2 arrays of 0-s (did not convert) and 1-s (converted), it's time to evaluate. In this first part we are going to use 2 common evaluation techniques:

  • subtraction
  • a statistical proportion test

Subtraction

Subtraction of the conversion rates is obviously the most simple, and perhaps therefore still quite popular. Usually an excel file is created which will look as follows:

In [4]:
import pandas as pd
In [5]:
result_df = pd.DataFrame(data={
'count': [len(A), len(B)],
'converted': [sum(A), sum(B)]},
index=['A (test)', 'B (control)']
).assign(conversion_rate=lambda x: x['converted'] / x['count'])
In [6]:
subtraction = pd.DataFrame(data={
'count': '',
'converted': '',
'conversion_rate': result_df.loc['A (test)', 'conversion_rate'] - result_df.loc['B (control)', 'conversion_rate']},
index=['Result'])                                
In [7]:
result_df.append(subtraction)
Out[7]:
count converted conversion_rate
A (test) 540 272 0.503704
B (control) 60 32 0.533333
Result -0.029630

Since the conversion rate of the control group is almost 3 percentage points higher than the test group, the most likely conclusion in a commercial business context would be that the coupon has no effect. Although it is actually possible that the coupon has a negative effect, in my experience this is seldomly considered.

Proportion test

A more statistically sound method is the proportion test. This test is based on classical statistics and tests whether we should accept or reject the null hypothesis:

$$ \text{null hypothesis: } \text{conversion rate}_a = \text{conversion rate}_b $$

$$ \text{alternative hypothesis: } \text{conversion rate}_a > \text{conversion rate}_b $$

In [8]:
from statsmodels.stats.proportion import proportions_ztest
In [9]:
conversions = [272, 32]
samples = [540, 60]
In [10]:
z_stat, p_value = proportions_ztest(conversions, samples)
print(f"the p-value of the test is {p_value}")
the p-value of the test is 0.6631969581949767

Since the p-value is larger than 0.05, we cannot reject the null hypothesis, and therefore conclude that the conversion rates are of equal proportion. This would lead to the same conclusion that the coupon has no effect.

For those of you who want more information about the p-value and it's long history of debate, misinterpretation, and misuse, please have a look at the references section.

What just happened?

So we ran a simulation, set the conversion rates to 48% and 45% respectively, and found that the control group 'won' from the test group with 53% versus 50%. Is it possible that we have been fooled by randomness?

We cannot see the truth

The first important thing is to understand the diference between the observed frequency and the true frequency of an event. The following is quoted directly from Bayesian Methods for Hackers [1]:

"The true frequency can be interpreted as the probability of an event occuring, and this does not necessarily equal the observed frequency. For example, the true frequency of rolling a 1 on a six-sixed die is $ \frac{1}{6} $, but if we roll the die six times we may not see a 1 show up at all (the observed frequency!). (...) Unfortunately, noise and complexities hide the true frequency from us and we must infer it from observed data."

Please take a moment to let this sink in, since it is essential for your understanding. Next we are going to use possibly the best tool for intuition enhancement, visualization.

Intuition by visualization

As mentioned earlier, random fluctuations should cancel out given enough data. Therefore, we are going to simulate our experiment many times with different sample sizes, and see what we can learn. For now, we will ease the 'max 10% control group rule', and let group A and B be of similar size.

In [11]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
In [12]:
small_n = np.arange(1, 1000, 10)
medium_n = np.arange(1000, 10000, 100)
In [13]:
n_sizes = np.hstack((small_n, medium_n))
In [14]:
np.random.seed(1337)
mean_test_a = [np.mean(bernoulli.rvs(p=0.48, size=x)) for x in n_sizes]
mean_test_b = [np.mean(bernoulli.rvs(p=0.45, size=x)) for x in n_sizes]
In [15]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(mean_test_a, lw=2, label='Observed data group A (48%)')
ax.hlines(y=0.48, xmin=0, xmax=190, color='red', alpha=0.7, lw=3, label='True underlying probability group A (48%)')

ax.plot(mean_test_b, ls='--', lw=2.5, color='orange', label='Observed data group B (45%)')
ax.hlines(y=0.45, xmin=0, xmax=190, color='green', alpha=0.7, lw=3, label='True underlying probability group B (45%)')

ax.set_xlabel('Sample size of the observed data points (n)', size=15, weight='normal')
ax.set_xticks([1, 5, 10, 20, 35, 50, 75, 100, 140, 190])
ax.set_xticklabels([11, 51, 101, 201, 351, 501, 751, 1000, 5000, 9900])
ax.set_xlim(0, 190)

ax.set_ylabel('Observed mean of the sample', size=15)
ax.set_yticks(np.arange(0, 1.1, 0.05))
ax.set_ylim(0.3, 0.65)

ax.legend()
;
Out[15]:
''

Every time the dotted orange line crosses the solid blue line, the control group outperformed the test group. Note that the two lines either overlap or remain close until a sample size of ~2000 for both A and B.

Hopefully from this plot alone, you now understand that you need a very large sample when you expect the difference between 2 groups to be very small (e.g., < 0.5%). Sometimes this means your experiment is not viable, and you need to go back to the drawing board.

Since we are finally talking business, what would happen if we would keep the 'max 10% control group rule'?

In [16]:
control_group = np.arange(1, 500, 3)
test_group = control_group * 9
In [17]:
np.random.seed(1337)
mean_test = [np.mean(bernoulli.rvs(p=0.48, size=x)) for x in test_group]
mean_control = [np.mean(bernoulli.rvs(p=0.45, size=x)) for x in control_group]
In [18]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(mean_test, lw=2, label='Observed data group A (48%)')
ax.hlines(y=0.48, xmin=0, xmax=167, color='red', alpha=0.7, lw=3, label='True underlying probability group A (48%)')

ax.plot(mean_control, ls='--', lw=2.5, color='orange', label='Observed data group B (45%)')
ax.hlines(y=0.45, xmin=0, xmax=167, color='green', alpha=0.7, lw=3, label='True underlying probability group B (45%)')

ax.set_xlabel('Total sample size of A (90%) + B (10%) of the observed data points (n)', size=15)
ax.set_xticks([3, 13, 23, 33, 83, 133, 166])
ax.set_xticklabels([100, 400, 700, 1000, 2500, 4000, 4990])
ax.set_xlim(0, 166)

ax.set_ylabel('Observed mean of the sample', size=15)
ax.set_yticks(np.arange(0, 1.1, 0.05))
ax.set_ylim(0.3, 0.65)

ax.legend()
;
Out[18]:
''

Although the observed data from group A stabilizes around a total sample size of ~2500, the observed data from the control group keeps making large deviations from its true underlying probability. Note that for the x-axis value of 4000, the control group consists of 400 (10%) customers, and the test group of 3600 customers (90%).

The practical implication of this insight is that 'more data' alone is not enough, you need more data for the group with the most uncertainty, in this case the control group.

By now I hope you have come to the conclusion that common evaluation techniques to analyze differences in conversion rates have flaws, and that simulation can be remarkably useful. Luckily for us, Bayesian inference can really help interpreting our experimental results, and drawing better conclusions. In part 2 I will show you how.


p.s. I am aware of things like frequentist power-analysis, sample size calculators, etc., but a discussion on frequentist-vs-bayesian techniques is not the goal of this article. The goal is to address potential pitfalls in experimentation, enhance your intuition on randomness, and show you a technique that - in my belief - is most informative. However, if you are looking for a serious introduction into the differences of frequentist and bayesian techniques, I highly recommend the The Bayesian New Statistics paper [2]. Note: the author is a Bayesian :).

References / further inspiration

Direct references:

  • [1]: Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference by Cameron Davidson-Pilon (2016).
  • [2]: The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian Perspective by John K. Kruschke and Torrin M. Liddell (2017). link to article

Back to the Proportion test.

The following references are not directly cited in this article. However, they were read and re-read with a lot of joy and wonder, and significantly contributed to my understanding and interest of this topic. Therefore, I did not want to leave them out. These references require no technical background.

  • The theory that would not die: how bayes' rule cracked the enigma code, hunted down russian submarines & emerged triumphant from two centuries of controversey by Sharon Bertsch McGrayne (2011).
  • The Signal and the Noise: why so many predictions fail - but some don't by Nate Silver (2015).
  • Fooled by Randomness: the hidden role of chance in life and in the markets by by Nassim Nicholas Taleb (2007).
  • The Black Swan: the impact of the highly improbable by Nassim Nicholas Taleb (2008).

** This post is a Jupyter Notebook, which can be found here.

This article is written by:
Ernst Bolle
Ernst Bolle
info@cmotions.com

Lees meer

How we inform and inspire ourselves as Data Scientists

Geplaatst op: augustus 8, 2020

The world of Data Science is changing in such a rapid pace and into such broad directions that it is impossible to even be informed on every aspect of our interesting work field, let alone use all of these marvelous insights in practice. Therefore, at The Analytics Lab and Cmotions, we focus on the parts of Data Science we can actually use in our daily work, enough innovation there already! But how do we keep ourselves informed, which sources do we find interesting to follow on a regular basis? Not only our work field is very broad, the number of sources to read/listen about it is also huge. Periodically, we exchange our preferred channels with each other. Since many of you might be in the same struggle to keep informed and inspired, we thought this might be interesting to share with you. By no means this is an extensive list, but these are the most common used sources by our Data Science consultants. Enjoy!

Read

Listen

  • https://www.superdatascience.com/podcast – Long running podcast show (almost 400 episodes) covering specific topics and many interviews with data science practitioners. Lots of career advice for starting data scientists
  • https://dataskeptic.com/ – very informative and fun series with interviews and multi-episode themes on NLP (natural language processing), Deep Learning and model interpretability. With many fun episodes where Kyle tries to explain all sorts of Data Science topics to his wife Linda, often with vivid examples including their parrot Yoshi
  • http://lineardigressions.com/ – Short (20-40min) episodes in which data pros Katie and Ben zoom in on specific analytical techniques and developments in data analysis and data science tools and techniques
  • https://talkpython.fm/ – on python in general, but also episodes focused on Deep Learning specifically
  • https://twimlai.com/ – interviews with inspiring people from the Data Science work field.
  • https://soundcloud.com/nlp-highlights – academical research on the field of NLP

Do you miss a valuable source in this list? Please let us know, we always love to broaden our horizon. For now, we can only wish you a lot of fun while immersing yourself in the world of Data Science!

This article is written by:
Jurriaan Nagelkerke
Jurriaan Nagelkerke
jurriaan.nagelkerke@cmotions.com
Wouter van Gils
Wouter van Gils
wouter.vangils@cmotions.com

Lees meer

Machine Learning Intuition

Geplaatst op: juni 10, 2020

Machine Learning Intuition, by Ernst Bolle

Let's say you are an ambitious data analyst and want to further expand your skillset. You have already applied numerous complex algorithms but you want more in-depth knowledge of their inner workings. For starters, you decide to dive into the details of the mother of all algorithms: linear regression. A colleague of yours tipped The Elements of Statistical Learning as a great resource, and now you are facing the following formula:

$$ \hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T y $$

For some of you this formula may feel intuitive, but for you, it doesn't. Since you can apply this algorithm with just a few lines of code you decide that it's not worthwhile to further pursue your interest.

Intuition matters

User-friendly APIs like scikit-learn make it easy to use very complex models without much in-depth knowledge. This is fine if you want to simply try stuff, but there are a number of reasons you want to avoid this in a more professional setting:

  • If you do not know how something works, you will almost certainly not know how to fix it when it stops working
  • Your colleagues might question your approach
  • Your business stakeholders might want to know what they can learn from your model
  • Mastering challenging material can be very rewarding

Do not get me wrong, in my opinion you do not need to know every last detail of an algorithm before you may use it, but having a good intuitive understanding of its inner workings is quite important. Given that the linear algebra approach is out of the question, what other ways are there?

In this article we are going to take a naive approach to solving a simple linear regression task. Although naive, it will uncover an important aspect of machine learning in general, and serve as a gateway to more complex algorithms. Furthermore terms like parameter space, loss function, and brute force approach are intuitively explained. As a bonus, there is python code included to reproduce this approach yourself.

Setup

We are going to solve a simple linear regression task, which means we only have 1 independent variable. The equation:

$$ y = \alpha + \beta x $$

where

  • $ \alpha $ is the intercept (in simple linear regression, the point where the line crosses the y-axis)
  • $ \beta $ is the slope of the line
  • $ x $ is the input variable

The essence of our linear model is that we are looking for a straight line that best fits our data. How we create this line and verify the best fit is discussed shortly, but first let's create a fictive dataset.

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
In [2]:
x = np.arange(11)
y = np.array([3, 5.5, 9, 7.5, 11, 5, 8, 11, 18.5, 16, 13.5])
In [4]:
fig, ax = plt.subplots()

ax.plot(x, y, 'o', markersize=10)
ax.set_xlabel('$ x_1 $', fontsize=14)
ax.set_ylabel('$ y $', fontsize=14)
ax.set_xticks(x)
ax.set_ylim([0, 20])

fig.tight_layout()

This is where we draw the line

The equation can be translated to an actual line by inserting values for the intercept- and slope parameter. In our example, the intercept refers to the place where the line starts on the y-axis, and the slope determines the steepness. The best fit is the line that minimizes the distance between the line and the data points.

Since we can visually examine this dataset let's plot a few lines and see which line fits best.

In [5]:
fig, ax = plt.subplots()

ax.plot(x, y, 'o', markersize=10)
plt.plot(x, 2+0.5*x, '-r', alpha=0.8, label='$ y=2+0.5x $')
plt.plot(x, 2.5+0.8*x, ':g', alpha=0.8, label='$ y=2.5+0.8x $')
plt.plot(x, 3.5+x, '--b', alpha=0.8, label='$ y=3.5+x $')
plt.plot(x, 4.5+1.2*x, '-k', alpha=0.8, label='$ y=4.5+1.2x $')

ax.set_xlabel('$ x_1 $', fontsize=14)
ax.set_ylabel('$ y $', fontsize=14)
ax.set_xticks(x)
ax.set_ylim([0, 20])

plt.legend(loc='upper left')

fig.tight_layout()

From this plot I would say the blue dashed line appears to fit best, but how can we decide this more formally? A commonly used measure is the root mean square error (RMSE), which returns the average of the sum of squared differences between the observed and predicted values:

$$ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(Y_i - \hat{Y_i})^2} $$

where

  • $ \hat{Y} $ is the prediction
  • $ Y $ is the actual observation
  • $ i $ is the number of the observation

Visually, this means we are looking for the line with the smallest total length of orange arrows:

In [6]:
# we take the absolute differences for visualization purposes
example_y_hat = 3.5+x
example_diff = abs(y-example_y_hat)
In [7]:
upperlimits = np.array([0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1], dtype='bool')
lowerlimits = np.array([1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0], dtype='bool')
In [8]:
fig, ax = plt.subplots()

ax.plot(x, y, 'o', markersize=10)
plt.plot(x, 3.5+x, '--b', alpha=0.8, label='$ y=3.5+x $')
plt.errorbar(x, y, yerr=example_diff, uplims=upperlimits, lolims=lowerlimits, ls='none', elinewidth=3)

ax.set_xlabel('$ x_1 $', fontsize=14)
ax.set_ylabel('$ y $', fontsize=14)
ax.set_xticks(x)
ax.set_ylim([0, 20])

plt.legend(loc='upper left')

fig.tight_layout()

More information can be found here, for now remember that the RMSE is also a loss function, an often heard term in the world of Machine Learning which we will revisit at the end of this article. Now that we have our data, a way to make predictions (lines), and a formal measure to check the best fit, we are ready to execute our naive approach.

Being naive has it perks

From the second plot we can eyeball which values of the intercept- and slope parameter are likely to result in a decent fit. Our naive approach is to just try a lot of different combinations of these parameter values and look for the combination where the RMSE is at its minimum.

My initial guess would be that the intercept is somewhere between 3 and 5, and the slope between 0.5 and 2.5. We are going to use itertools to create every possible combination of this initial guess.

In [9]:
import itertools as iter

intercept = np.linspace(3, 5, 500) 
slope = np.linspace(0.5, 2.5, 500) 

combination = iter.product(intercept, slope)

Next we are going to execute the linear regression formula and RMSE measure for every combination and save the results as tuples in the result_data list. As a reminder we defined the x and y values as follows:

x = np.arange(11)
y = np.array([3, 5.5, 9, 7.5, 11, 5, 8, 11, 18.5, 16, 13.5])
In [10]:
result_data = []

for intercept, slope in combination:
    y_hat = intercept + slope*x    
    rmse = np.sqrt(np.mean((y-y_hat)**2))
    
    result_data.append((intercept, slope, rmse))
In [11]:
result_data[:5]
Out[11]:
[(3.0, 0.5, 5.504130680266825),
 (3.0, 0.5040080160320641, 5.483655240723603),
 (3.0, 0.5080160320641283, 5.463205976919804),
 (3.0, 0.5120240480961924, 5.442783183892689),
 (3.0, 0.5160320641282565, 5.422387160739607)]

Since this article is about intuition let's label our results_data list of tuples and visualize it.

In [12]:
data = {
'intercept': [x[0] for x in result_data],
'slope': [x[1] for x in result_data],
'rmse': [x[2] for x in result_data],
}
In [13]:
fig, ax = plt.subplots()

scatter = ax.scatter('intercept', 'slope', c='rmse', data=data)
legend  = ax.legend(*scatter.legend_elements(),
                    loc="lower left", title="RMSE", bbox_to_anchor=(1, 0))
ax.add_artist(legend)
ax.set_xlabel('$ \\alpha $', fontsize=14)
ax.set_ylabel('$ \\beta $', fontsize=14)

plt.show()

From this graph we can clearly see a region of combinations of values with lower RMSE values in comparison to the upper right and lower left corner. To find the index of the minimum RMSE value and corresponding parameter values we transform the data once more, this time into a Pandas dataframe.

Food for thought: can you explain the slightly negative relation between the intercept and slope values?*

In [14]:
import pandas as pd
data_df = pd.DataFrame(data)
In [15]:
# retrieve the index
low_idx = data_df['rmse'].idxmin()
In [16]:
# retrieve the values for the corresponding index
data_df.iloc[low_idx, ]
Out[16]:
intercept    4.050100
slope        1.153307
rmse         2.713758
Name: 131163, dtype: float64

The moment of truth

This has all been a blast but I am sure you are wondering whether this result even comes close to the exact linear algebra result? Let's find out by using our beloved few lines of code approach, scikit-learn.

In [17]:
from sklearn.linear_model import LinearRegression
In [18]:
X = x.reshape(-1, 1)
In [19]:
lin_model = LinearRegression()
lin_model.fit(X, y)
Out[19]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
In [20]:
print(lin_model.intercept_, lin_model.coef_)
4.045454545454545 [1.15454545]

Not too shabby.

If you're being smart about being naive, you have a winner

The essence of what we have been doing is the following:

  • We have defined possibly decent combinations of parameter values, also known as a parameter space
  • We have defined a measure to formally check the best combination, also known as a loss function
  • We have tried every possible combination, also known as a brute force approach

The process of searching and validating the best combinations of parameter values is very common in Machine Learning, also for much more complex challenges. A big difference between our naive approach and more sophisticated techniques is the search efficiency of the latter. The inner workings of these techniques are out of scope for this article but the following terms are related and worthwhile for further study (see Further studying for tips):

  • (Stochastic) Gradient Descent
  • Hyperparameter optimization

If you are wondering whether we could not simply further expand our combinations for more complex challenges, realize that for this extremly simple example we already tried 250.000 different combinations, and we were lucky enough to narrow our search due to the visualizations. In almost any real life scenario this approach is not feasible due to time- and memory constraints.

Regardless of the shortcomings, I hope this approach was helpful to create a mental picture when you are waiting for your notebook to finish after hitting the fit button.

Further studying

The following resources not only inspired this article, but are also (a lot of!) fun and interesting to study:

Books

  • An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
  • Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurelien Geron
  • Statistical Rethinking by Richard McElreath

Youtube Series

MOOC

Do not forget to enjoy the journey since its all too easy to get hung up in the latest, newest, and coolest techniques. There is nothing wrong with getting the basics right.

*The higher the intercept the lower the slope must be in order to remain close to these observed values, thereby minimizing the RMSE.

** This post is a Jupyter Notebook, which can be found here.

This article is written by:
Ernst Bolle
Ernst Bolle
info@cmotions.com

Lees meer

DeepFaking myself in the office

Geplaatst op: mei 20, 2020

In the last couple of years, my Data Science attention went mostly towards text (NLP / NLU), but that does not prevent me from playing around with video. Inspired by Trump’s response to his Corona approach and Jim Carrey vs Allison Brie, see my first attempt at playing with DeepLearning for video and DeepFakes.

While it is by no means perfect. I must say that I am very impressed with how easy it was to accomplish this result. In my opinion, it is therefore a good example of the Democratization of AI. Although these trends carry risks (from Fake News to Garbage in / out), am I a big believer that this trend will bring us more good than bad.

It took me four days of training time, well not me, my computer (MSI Prestige with NVIDIA GTX1070). While it was crunching,  I was actually away for the weekend. Personally, I think I didn’t spend more than two or three hours working on it. Where most of my time went into taking a video of myself creating training data, and some after effects.

Another trend that is emerging is a greater focus on efficiency. Days / weeks / months of training and pretrained models that consist of billions of parameters; GBs, or even TBs, in size, are not always easy and quick to get into production. While only when algorithms are in production, they can add value. A while back I heard that only 20% of the algorithms go into production. When I look at my years as a data scientist consultant, I’ve seen quite some organizations that didn’t even reach 20%. But that’s a story for another time.

For now let’s get back to DeepFake. Or how you can do this by yourself! This is not going to be a detailed tutorial, but more the steps I took and stuff I learned from it. You also don’t need to be very techy to be able to do this!

For starters, currently, there are two major packages for deepfakes: FaceSwap and DeepFaceLab. I used the latter, for the sole reason that I heard great stories about it.

  1. Find a video where you want to put a / your face on.
  2. Create a video of around 20 minutes of yourself. I just recorded myself while I was working. I actually found the video while I was recording, looking back, I would definitely first find the video. The reason is quite simple, if you imitate the destination video, the results will be better.
  3. Some tips: 
    • Try to get the lighting the same (mine was too light), 
    • imitate speech (as i was filming while working, I didn’t really speak, so in the results my teeth are somewhat blurry and Jeroen Scott doesn’t articulate very well)
    • maybe shave the same as the person you’re imitating.
  4. Then go through the batch files:
    1. clear workspace.bat
    2. extract images from video data_src
    3. extract images from video data_dst FULL FPS.bat
    4. data_src faceset extract.bat
    5. data_dst faceset extract.bat
    6. train SAEHD.bat
      I used the SAEHD for training, there are other options, but from all the blogs I read, this one was use most. 
    7. TIP: go away for the weekend. your computer will make a lot of noise. You won’t be able to concentrate on other stuff without earplugs. My main question here was: how long do I need to train, and I found multiple answers. One said three to seven days (wow big help), another said until you reach a loss of 0.02. Well I got to 0.16 after four days, the next day I needed to go work again and didn’t want to be distracted by the noise of my computer so I stopped. (You can always pick up the training at a later stage!) and I am quite happy with the results.
    8. merge SAEHD.bat
    9. merged to mp4.bat. There is an interactive shell which allows you to alter things like the size of your head (I didn’t need to change), skin color (ended up using mix-m) and other stuff. The secret here: almost nobody knows what is really happening here, just try some stuff out.

Check out my DeepFake video!

This article is written by:
Jeroen Kromme
Jeroen Kromme
info@cmotions.com

Lees meer

Nachos hackathon uitgesteld

Geplaatst op: maart 16, 2020

Deze week is er in de haven van Rotterdam een flinke partij Mexicaans bier onderschept. Om verdere verspreiding van deze verslavende drank te voorkomen, heeft de regering voor de komende weken strenge beperkingen gesteld op alle goederen die vanuit risicogebieden geïmporteerd worden.

Dit betekent dat ook de import van tortillachips, avocado’s en jalapenopepers de komende tijd aan banden wordt gelegd en we helaas hebben moeten besluiten om de Nachos hackathon van Cmotions en The Analytics Lab op vrijdag 3 april a.s. uit te stellen.

Er zal een aankondiging verschijnen zodra de nieuwe datum voor de hackathon definitief is. Waarschijnlijk wordt dit vrijdag 16 oktober, zet deze datum alvast met potlood in jullie agenda!

Wil jij voor de inschrijving van jouw kartel voor onze hackathon op de hoogte worden gehouden, mail naar j.kromme@cmotions.nl en wij zorgen dat je niets mist van de updates omtrent onze hackathon.

This article is written by:
Niek Agema
Niek Agema
info@cmotions.com

Lees meer

Conversational Analytics Meetup

Geplaatst op: februari 16, 2020

Storm Ciara was nog maar net gaan liggen toen het alweer tijd was voor onze Meetup op woensdag 12 februari bij True in Amsterdam. Eigenlijk kwam de storm te vroeg, want het leek een traditie te worden dat onze Meetup samenvalt met een grote nationale gebeurtenis. Zo hebben we al een Champions Leauge finale, de landelijke OV staking en het boerenprotest voor onze kiezen gehad. Toch maakten we ons niet druk, want het is tot nu toe altijd gelukt om er een geslaagde en interessante avond van te maken. De Meetup van afgelopen woensdag vormde daar natuurlijk geen uitzondering op.

Deze keer stond de avond in het teken van Conversational Analytics, met een wel heel interessant programma. Na een lekkere maaltijd en een drankje werd de avond afgetrapt door Lotte Wijngaards (Product Owner Online Serice & Conversational Bots)  en Nigel Pouw (Web Analist) van PostNL, die ons meegenomen hebben in de wereld van Daan, de digitale collega van de PostNL klantenservice. Vervolgens was het de beurt aan Jeroen Kromme en Elroy Verhoeven, founders van start-up Tailo, een conversational intelligence platform waarmee zij klantgesprekken om kunnen zetten naar waardevolle inzichten en acties.

Lotte en Nigel lieten ons nader kennis maken met Daan, de digitale collega van de PostNL klantenservice. Lotte heeft ons uitgebreid verteld over het ontstaan van Daan, hoe hij de klantenservice kan ondersteunen, het werk van de klantenservice collega’s leuker kan maken én er bovendien voor kan zorgen dat de klant sneller geholpen kan worden. Dit kon Daan natuurlijk niet meteen vanaf dag één, dus Lotte heeft ons meegenomen in de reis die zij moesten maken om tot dit resultaat te komen en de verschillende uitdagingen die daarbij kwamen kijken. Natuurlijk stopt het hier niet, PostNL wil het klantcontact continu verbeteren en daar kan Daan goed bij helpen. Ook deze chatbot kan dat niet in zijn eentje en dus moet er af en toe iemand een kijkje in zijn data nemen. Nigel heeft ons meer verteld over hoe je dankzij analytics continue inzichten kunt verkrijgen en hierop kunt sturen, om de klant uiteindelijk nóg beter te kunnen helpen. Vanuit het publiek kwamen er heel veel vragen over Daan, wat het tot een leuke interactieve sessie maakte.

Na een korte pauze was het de beurt aan twee van de founders van Tailo. Jeroen en Elroy zijn data scientists met een uitgebreide achtergrond in machine learning die zagen dat veel organisaties niet tot nauwelijks waarde halen uit de verhalen en gesprekken van klanten. Jeroen heeft ons meegenomen in de verschillende mogelijkheden van Tailo zoals meer klantinzicht, beter quality management, betere coaching en versnelde analyses, waarbij hij op een aantal aspecten wat dieper heeft ingezoomd. De echte data science liefhebbers in het publiek konden hun hart ophalen toen het de beurt aan Elroy was, die ons een kijkje onder de motorkap heeft gegeven en van alles heeft uitgelegd over de achterliggende techniek. Dit zorgde natuurlijk ook weer voor een hoop vragen en ideeën wat er voor zorgde dat er tijdens de borrel nog een tijdje over nagepraat is.

Kortom, het was weer een zeer geslaagde avond en we kijken alweer uit naar de volgende, dus hou onze pagina op het platform in de gaten https://www.meetup.com/nl-NL/The-Analytics-Lab-Meetup/.  Wil je meer weten over onze Meetups of specifieke onderwerpen die we hierin behandelen, neem dan contact met mij op.

This article is written by:
Jurriaan Nagelkerke
Jurriaan Nagelkerke
jurriaan.nagelkerke@cmotions.com
Wouter van Gils
Wouter van Gils
wouter.vangils@cmotions.com

Lees meer

Meld je aan voor onze Nachos Hackathon @ 3 april 2020

Geplaatst op: februari 12, 2020

Al jarenlang staan de kranten er vol van: Nederland dreigt een nachostaat te worden. Het gebruik van tortillachips in het uitgaansleven is eerder een regel dan uitzondering en er vaart geen vrachtschip uit Zuid-Amerika de haven binnen zonder verstopte pakketjes jalapeño pepers en crème fraîche in bananendozen. In schimmige Nederlandse kelders worden avocado’s versneden tot guacemole, de nachoticabrigade van de politie heeft haar handen vol aan het oprollen van de meest exotische varianten van dit ooit zo onschuldige driehoekige maïsproduct.

‘Data was het nieuwe goud, nacho’s zijn de nieuwe data’.

Om te anticiperen op deze ontwikkeling en een graantje mee te kunnen pikken, is het thema van onze hackathon dit jaar ‘Nachos’. Op vrijdag 3 april 2020 gaan de deelnemende teams proberen een eigen data gedreven nacholijn op te zetten, van productie tot verkoop. Om de nachobrigade voor te zijn, wordt deze nacholijn met behulp data (science) opgezet. Wie krijg het beste nacho-netwerk, en wie wordt er gepakt?

De briefing over de details en de locatie van het transport volgt, maar zet de datum alvast in je agenda en formeer met je collega’s een ijzersterk kartel van 3-6 personen. Geef jouw nacho-bende op door een mail te sturen naar j.kromme@cmotions.nl.

Lees hier meer over onze vorige hackathons:

This article is written by:
Jurriaan Nagelkerke
Jurriaan Nagelkerke
jurriaan.nagelkerke@cmotions.com
Wouter van Gils
Wouter van Gils
wouter.vangils@cmotions.com

Lees meer

Conversational Analytics Meetup @ 12 februari 2020

Geplaatst op: januari 12, 2020

Op woensdag 12 februari is het tijd voor een nieuwe editie van onze Meetup. Deze keer gaan we het hebben over conversational analytics en de toepassing ervan. Voice is ‘hot’. Daarom hebben we twee interessante sprekers vastgelegd die ons hier meer over kunnen vertellen, namelijk PostNL en Tailo.

Over de sprekers

Nigel Pouw en Lotte Wijngaards van PostNL vertellen over Daan: de digitale collega van de PostNL klantenservice. De aanleiding, hoe Daan “Daan” werd, de resultaten én de continue insights & sturing dankzij analytics….het komt allemaal aan bod de 12e!”

Onze andere sprekers zijn de founders van start-up Tailo. Zij nemen ons mee in de wereld van A.I. & klantcontact.

Locatie & voertaal

Ook deze keer organiseren we onze Meetup op onze ‘nieuwe’ locatie in Amsterdam, bij True. Deze locatie is zowel met de auto als met het OV zeer goed te bereiken. En ook deze sessie doen we weer in het Nederlands, omdat beide cases juist vanuit de Nederlandse taal uitgevoerd zijn.

True BV

Keienbergweg 100, Amsterdam

Agenda

18:00 – 19:00 – Ontvangst met een hapje en drankje

19:00 – 19:50 – PostNL

20:00 – 20:50 – Tailo

20:50 – 21:30 – Afsluitende borrel

Aanmelden

We zien je graag op 12 februari! Aanmelden kan via onze Meetup pagina.

Dit artikel verscheen eerder op Meetup.com.

This article is written by:
Jurriaan Nagelkerke
Jurriaan Nagelkerke
jurriaan.nagelkerke@cmotions.com
Wouter van Gils
Wouter van Gils
wouter.vangils@cmotions.com

Lees meer