The importance of Unit Testing for Data Science
Geplaatst op: november 26, 2024
Getting value out of geodata with AI: visualize the model predictions
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: explainability using SHAP
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: train the model
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: data preparation
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: convert locations to their lat and lon
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: getting started
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
What do you need to know to start working with geographical data?
Did you ever question how Google Maps calculates the distance between two places? Or how the government keeps track of all the places where the utility services like sewers, gas and electricity pipes are? Both cases are examples of the use of Geographical Information Systems (mostly known as GIS) with geographical data. Geographical data is data related to a specific place or specific area on earth, for example, your address or some coordinates. If you want to start working with geographical data, the most important concept is GIS. A GIS is an information system with which spatial data or information about geographical objects can be saved, managed, edited, analyzed, integrated and presented.[1]
GIS systems have been around for a while and have been further developed in recent years. But how did it actually start? In the 1960s, Canadian geographer Roger Tomlinson came up with the idea of using a computer to aggregate information from his natural resources and create an overview by province. With this, the first Geographic Information System was born. In 1985, GIS was used for the first time in the Netherlands.[2]
Project information on a map
Today there are numerous possibilities to use GIS and therefore to work with geographical data. Before you make a choice in what way you want to work with it, it is important to understand some concepts. To start with map projections and associated coordinate reference systems (CRS), because how do we ensure that the round earth can be shown on a flat map, i.e. two-dimensional? In order to represent the Earth on a map with reasonable accuracy, cartographers have developed map projections. Map projections try to represent the round world in 2D with as few errors as possible. Each projection deals with this in a different way and has advantages and disadvantages. For example, one projection is good at preserving the shape, but doesn’t display the correct size of all countries, while the other doesn’t keep the right shape but is more accurate in size. If you want to see the real size and shape of the world, you will always have to look at a 3D map or globe. The following video can be recommended if you want to get some more information about the consequences of different projections: Why all world maps are wrong.

[3] https://medium.com/nightingale/understanding-map-projections-8b23ecbd2a2f
Coordinate reference systems are a framework to define the translation from a point on the round earth to the same point on a two-dimensional map. There are two types of reference systems: projected CRS and geographic CRS. A geographic CRS defines where the data is located on the earth’s surface and a projected CRS tells the data how to draw on a flat surface, like on a paper map or a computer screen.[4] Geographic CRS is based on longitude and latitude. Longitude and latitude are numbers that explain where on the round Earth you are. Longitude defines the angle between the Prime Meridian (at Greenwich) and every point on Earth, where the angle is calculated in an easterly direction. Latitude defines the angle between the equator and every point. However, latitude is calculated in two directions and all points on the Southern Hemisphere are negative. Projected CRS defines the place on a two-dimensional map instead of the round world. Here, x- and y-coordinates are used and the distance between all neighboring x- and y-coordinates are the same. [5]

[4] https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/
Different types of layers
Two other important concepts are raster and vector layers. GIS files are constructed from different map layers. These layers can be built up in two different manners. Just like that there is a difference between raster pictures and vector pictures, there are also raster layers and vector layers and it defines the way a layer is created. Raster layers consist of a collection of pixels. Vector layers, on the other hand, consist of a collection of objects. These objects can be points, lines, or polygons. Points consist of X and Y coordinates, usually latitude and longitude. Line objects are vectors that connect points. And polygons are areas on the map. Sometimes multiple areas are represented as one object, they are called multipolygons. Vector layers are commonly used when using geographical data.

[6] https://medium.com/analytics-vidhya/raster-vs-vector-spatial-data-types-11325b83852d
Saving your map
If you are working with data and you want to save your file, it is important to know which file formats exist for geo data. Where text files can be saved in .txt or .docx and Excel files can be saved in .xlsx or .csv, there are also specific file formats for geo data. The most common format for vector data is a Shapefile. A Shapefile does not consist of one file, but a collection of files with the same name and which are placed in the same directory, but all with different formats. To be able to open a Shapefile it is necessary to at least have a .shp file (Shapefile), a .shx file (Shapefile index file) and a .dbf file (Shapefile data file). Other files such as .prj (Shapefile projection file) can be included as well for extra information.[7] Another, relatively new, format is GeoPackage. This format stores vector features, tables, and rasterized tiles into a SQLite database.
Both Shapefiles and GeoPackage files can be downloaded from any GIS and can also be uploaded into any other GIS. If you keep working on the same GIS at the same directory, it is also possible to save your project as a GIS project. In that case, it is important that the data you have uploaded into your file stays at the same directory, since the project does not save the data, but the reference to the data.
Create a map yourself
Now you know the basic concepts of working with geographical data. The next step is to decide which software you want to use for geographical data. In general, you could divide it into two types of possibilities:
- Via specifically developed GIS software
- Via common programming languages
Two well-known specifically developed GIS systems are ArcGIS and QGIS. ArcGIS is a paid software where you can use the software by means of a license. QGIS, on the other hand, is an open-source software. That means that it’s free for the user. QGIS is an official project of the Open Source Geospatial Foundation, a non-profit organization that aims to make the use of geodata accessible to everyone. Both programs are similar in use and have similar functionality and capabilities.
In addition to GIS software, it is nowadays also possible to use geodata via Python or R. Several geopackages are already available which makes it possible to use geodata. A well-known geopackage for Python is GeoPandas. The goal of GeoPandas is to make working with geographic data in Python easier. It combines the capabilities of pandas and shapely.[8] Data is stored at GeoPandas in GeoDataFrames. These GeoDataFrames are similar to Pandas DataFrames, but an important difference is that a GeoDataFrame always contains a geometry column. It stores the corresponding geographic data for each row.
The usage of QGIS and GeoPandas with Python are different, but the possibilities with both options are mostly the same. Both systems are able to load different file formats and easily plot the geographical data for you. However, the visualization is better in QGIS since the map in Python is a static map and does not give you the possibility to zoom. In QGIS you can easily zoom and add a standard map layer (such as a layer from Google Maps) to put your data into a broader perspective. Furthermore, a lot of analyses are accessible in both systems. For instance, determining the distance between two places or creating a buffer around your polygons. More information regarding these analyses with geographical data are discussed in this blog.
In short, you can work with geographical data in a GIS. For this, you can use a specific software like QGIS or you can use the GeoPandas package in Python. You have different map projections and associated coordinates reference systems to plot the three-dimensional world into a two-dimensional map. Furthermore, GIS files are constructed from different map layers which can be vector layers or raster layers. And these GIS files can be saved as a Shapefile or as a GeoPackage.
So, now you know everything you need to know to start working with geographical data.
This article is part of our series about working with geographical data. The entire series is listed here:
- Getting value out of geodate with AI: getting started
- Getting value out of geodate with AI: convert locations to their lat and lon
- Getting value out of geodate with AI: data preparation
- Getting value out of geodate with AI: train the model
- Getting value out of geodate with AI: explainability using SHAP
- Getting value out of geodate with AI: visualize the model predictions
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Sources
[1] https://nl.wikipedia.org/wiki/Geografisch_informatiesysteem
[2] https://www.esri.nl/nl-nl/over-ons/wat-is-gis/geschiedenis
[3] https://medium.com/nightingale/understanding-map-projections-8b23ecbd2a2f
[4] Geographic vs Projected Coordinate Systems (esri.com)
[5] https://desktop.arcgis.com/en/arcmap/10.3/guide-books/map-projections/about-projected-coordinate-systems.htm#:~:text=A%20projected%20coordinate%20system%20is%20always%20based%20on%20a%20geographic,the%20center%20of%20the%20grid.
[6] https://medium.com/analytics-vidhya/raster-vs-vector-spatial-data-types-11325b83852d
[7] https://www.e-education.psu.edu/geog585/node/691
[8] https://geopandas.org/en/stable/
Complete Azure DevOps CI/CD guide for your Azure Synapse based Data Platform – part II
Geplaatst op: juni 23, 2023
I wrote an article recently describing how to implement Continuous Integration and Continuous Deployment (CI/CD) for your Azure Synapse based data platform. At the end of part one we setup the infrastructure required for CI/CD, and we did our first parameterized deployment from our development environment to the production environment.
In part two we are going to focus on specific use cases to make our CI/CD process completer and more robust.
Situation
We have just setup our first Azure DevOps CI/CD pipeline for our Azure Synapse based Data Platform (Preferably with the help of part 1). It does the job well, but we run into some limitations that are that not immediately obvious to solve. Specifically, two uses cases; what to do with our triggers that should behave differently between our development and production environment, and what to do with our serverless SQL databases which is currently not easy to deploy to production.
Dynamically adjust triggers between your development and production environment
It is likely that you want different triggers running your Synapse pipelines in the development and production environment. Azure Synapse has the functionality to start and stop triggers, meaning when a trigger is stopped it will not run the pipeline it is attached to. This gives us the opportunity to start and stop certain triggers in development and production. This is important because of multiple reasons. For example, we do not want to run our development pipelines at the same time as our production pipelines because of workload constraints.
Luckily, we can automate the process of starting and stopping triggers. We can toggle the triggers on or off in our Azure DevOps pipeline, we do this by adding an extra task to our YAML file. The following code will stop all the triggers in our development environment.
- task: toggle-triggers-dev@2
displayName: 'Toggle all dev triggers off'
inputs:
azureSubscription: '${{ parameters.subscriptionDev }}'
ResourceGroupName: '${{ parameters.resourceGroupNameDev }}'
WorkspaceName: '${{ parameters.synapseWorkspaceNameDev}}'
ToggleOn: false
Triggers: '*'
Code snippet 1: Stopping all triggers in dev
It is worth paying attention to the specific value for “Triggers”. In the example above we give it the wildcard value “’*’” which means that all the triggers in development environment will be stopped. We could also give it some hard coded names of triggers that we want to stop, as seen in code snippet 2.
- task: toggle-triggers-dev@2
displayName: 'Toggle specific triggers off'
inputs:
azureSubscription: '${{ parameters.subscriptionDev }}'
ResourceGroupName: '${{ parameters.resourceGroupNameDev }}'
WorkspaceName: '${{ parameters.synapseWorkspaceNameDev}}'
ToggleOn: false
Triggers: 'trigger1,trigger2,trigger3'
Code snippet 2: Stop specific triggers
In this example, only the triggers called trigger1, trigger2, and trigger3 will be stopped. This solves some of our problems, because we can now toggle the production triggers in the production environment by adding them to the list. The same applies to the development environment. However, this would mean that we always need to adjust our Azure DevOps pipeline when we create new triggers, which is not an ideal situation. That is why we need to go one step further and automate this process.
We start with naming conventions for our triggers. We name every trigger in our development environment that needs to be started “_dev” and every trigger in our production environment that’s needs to be started “_prd”. Now we have something that we can use to recognize the environment the trigger is meant for. Unfortunately, at the time of writing this article, it is not possible to use some kind of string recognition in our “Triggers” parameter, we need to give it exact names of the triggers. That is why we need a workaround.
In order the get the correct list of triggers, we are going to use a PowerShell script which we will run in the Azure DevOps pipeline in replacement of the task we described above.
$triggers = az synapse trigger list `
--workspace-name ${{ parameters.synapseWorkspaceNameDev}} `
--query "[].name"
foreach ($trigger in $triggers) {
if ($trigger.Contains("_dev")) {
$trigger = $trigger.Trim() -replace '[\W]', ''
az synapse trigger start `
--workspace-name ${{ parameters.synapseWorkspaceNameDev }} `
--name $trigger
}
}
Code snippet 3: Starting triggers dynamically in development
First, we create a variable named “triggers” which will contain a list of all the triggers in our development environment. We achieve this by using the Azure CLI. Make sure to add “—query “[].name”” to only get the names of the triggers. Next, we are going to loop over this list with a for each loop and check for every trigger in the list if it contains “_dev”. If this is the case, we make sure that there is no whitespace in the name of the trigger and then we run an Azure CLI command to start this trigger. This way all the triggers in with “_dev” in their name will be started.
We run this PowerShell script using an Azure CLI task in our YAML file.
#Start triggers in dev synapse environment
- task: AzureCLI@2
displayName: 'Toggle _dev triggers on'
continueOnError: false
inputs:
azureSubscription: '${{ parameters.subscriptionDev }}'
scriptType: pscore
scriptLocation: inlineScript
inlineScript: |
$triggers = az synapse trigger list `
-–workspace-name ${{ parameters.synapseWorkspaceNameDev}} `
--query "[].name"
foreach ($trigger in $triggers) {
if ($trigger.Contains("_dev")) {
$trigger = $trigger.Trim() -replace '[\W]', ''
az synapse trigger start `
--workspace-name ${{ parameters.synapseWorkspaceNameDev }} `
--name $trigger
}
}
Code snippet 4: Azure DevOps pipeline task to run PowerShell script
The same can be done for our production environment. By using the naming conventions in combination with the PowerShell script we can now automatically and dynamically start and stop triggers in our Azure Synapse environments. We do not need to manually add triggers to our Azure DevOps pipeline, but we only need to stick to our naming conventions, which will hopefully result in less bugs.
How to deal with your serverless SQL pool?
At the moment of writing this blog Microsoft does not have an out of the box solution for automatically deploying our serverless databases and associated external tables from our development to our production environment. Therefore, we use a pragmatic and quite easy solution to solve this limitation.
In this example we are going to focus on creating external tables on our data which we need in our production environment. The external tables are not automatically deployed using the Azure DevOps pipeline, therefore we need a workaround. We can do this by manually creating a migration pipeline in our development Synapse workspace.

Figure 1: Example of migration pipeline
As an example, I have created three Script activities in the pipeline which all contain a SQL script to create external tables on the existing data. We already ran these scripts in our development environment, but we want to also run them in our production environment.
Since we do not want to do this manually, we need to find a way to automate this. We will do this by adding the following task at the end of our Azure DevOps deployment pipeline.
#Trigger migration pipeline
- task: AzureCLI@2
condition: eq('${{ parameters.triggerMigrationPipeline }}', 'true')
displayName: 'Trigger migration pipeline'
inputs:
azureSubscription: '${{ parameters.subscriptionPrd }}'
scriptType: pscore
scriptLocation: inlineScript
inlineScript: |
az synapse pipeline create-run `
--workspace-name ${{ parameters.synapseWorkspaceNamePrd }} `
--name ${{ parameters.migrationPipelineName }}
Code snippet 5: Azure DevOps pipeline task to trigger migration pipeline
After the synapse workspace is fully deployed to our production environment, this task will trigger the migration pipeline we just created. We run the migration pipeline by executing an Azure CLI command “az synapse pipeline create-run” where we specify the Synapse workspace name and the name of the migration pipeline. By running this pipeline in our Synapse production environment, we ensure that the external tables are created in the production serverless SQL pool.
As you can see, we added a condition to the task which states this task will only run if our parameter “triggerMigrationPipeline” is set to “true”. By adding this to our CI/CD pipeline, we can trigger the migration pipeline in our production environment only if we want it to.
In the example above we focused on creating external tables in the serverless SQL database, but the migration pipeline can be used for multiple purposes. For instance, if we have pipelines which need a special initialization to run properly, we can put those initialization activities in our migration pipeline. For example, creating a stored procedure that is used in our pipeline. In summary, we can put all the activities we need to make sure that the triggered pipelines are going to run without any errors in the migration pipeline to make sure everything is initialized.
If you are looking for a more complete version of implementing CI/CD for your serverless SQL pool, you can check out this blog by Kevin Chant, in which he uses a .NET library called DbUp.
Summary
The use cases discussed above will make your CI/CD process more dynamic and robust. By dynamically adjusting triggers between your environments and adding a migration pipeline to migrate your serverless databases, you will have less manual work when deploying to your production environment. Of course, there are more automation possibilities and uses cases to further enhance your CI/CD process for your Azure Synapse based data platform, and with the development of the Microsoft stack, we will get new possibilities to make our lives easier.
If you would like more in-depth code or contribute yourself, check out: https://github.com/atc-net/atc-snippets/tree/main/azure-cli/synapse/Publish.
Princess Peach rescued from the claws of Bowser at our Super Mario Hackathon
Geplaatst op: mei 3, 2023
At 21 april 2023, we held an epic hackathon where 10 teams from KPN, Rabobank, DPG, Athora, ANWB, UWV, Lifetri, and Eneco joined forces to rescue our beloved Princess Peach from the claws of Bowser. It was a fierce battle, but in the end, only one team emerged victorious. And who was that, you may ask? It was none other than the heroic team from UWV! Congratulations, heroes!

But let’s not forget all the other teams who participated. You all showed great courage and skill, and we thank you from the bottom of our hearts. It was truly an adventure we’ll never forget.
Now, without further ado, we present to you the amazing aftermovie of this epic hackathon, created by none other than Jan Persoon. Check it out below, and relive the excitement!
But wait, there’s more! We’re already hard at work preparing for our next hackathon of 2024, so keep an eye on our social media channels for updates. In the meantime, enjoy the aftermovie and let it inspire you for your next adventure.
Until next time, may the stars guide your path!
