Bridging the Gap: How AI Translates Sign Language into Speech
Geplaatst op: februari 28, 2025
Exploring how AI-powered models transform sign language into natural-sounding spoken words
At Cmotions, we enjoy leveraging AI to solve business or societal challenges. One of the challenges we set out to address is communication accessibility for individuals who rely on sign language. What if AI could bridge the gap between sign language and spoken language, making interactions more seamless?
With that vision in mind, we embarked on a journey to explore if we can create something that is able to interpret (WL)ASL sign language and convert it into English speech. In other words, a sign-to-speech translation using AI. The complete process consists of several steps. First, we fine-tuned a pre-trained model to enable translating videos of sign language into predicted ‘glosses’ (textual representation of signs). In another article we walk you through the process of finetuning a specialized model for that. In this notebook we use that finetuned model to interpret short video clips of sign language and convert them into written glosses. These glosses are then fed into a large language model (LLM), which transforms them into grammatically correct English sentences. Finally, the written output is converted into speech, making sign language accessible through audio. In short, below we show how we can use the model we have trained to (1) transform a handsign video into glosses (the words or terms expressed), then (2) into a sentence everyone can understand and finally (3) into spoken text.
Follow along to see how a video of someone signing the sentence “Where do you prefer to go on vacation?” is converted into a clear, natural-sounding audio note.
Let’s start by displaying the video. As you can see this video is a combination of the separate expressed signs. This also shows how we have currently trained our model: on separate signs expressed. Our model classifies the most probable sign (or ‘gloss’ in sign language lingo) expressed.
Tools & Packages
What packages do we use? Here’s an overview:
- Torch: Torch is an open-source machine learning library, the backbone for many deep learning use cases.
- Ollama: Ollama is a lightweight, extensible framework for building and running language models efficiently.
- Langchain: LangChain is a framework for developing applications powered by large language models (LLMs).
In the three steps (video to glosses, glosses to natural text sentence, and natural text sentence to audio) we use different tools and models:
- Transformers, opencv-python, torchvision and pytorchvideo: to apply our finetuned model to the new video file snippets to predict the glosses.
- Ollama and langchain: to transform the predicted glosses to a natural sentence.
- gTTS and ipython audio: to create and play the natural sentence as an audio file.
You can learn more about these packages in the following resources:
- https://ipython.org/ipython-doc/3/api/generated/IPython.display.html
- https://pytorch.org/
- https://ollama.com/
- https://www.langchain.com/
- https://pypi.org/project/opencv-python/
- https://pytorch.org/vision/stable/index.html
- https://pypi.org/project/pytorchvideo/
- https://pypi.org/project/gTTS/
# make sure you have ollama installed and pulled the llm used: %sh ollama pull qwen2.5:32b-instruct
import torch
import os
import pytorchvideo
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
from pytorchvideo.transforms import ApplyTransformToKey, UniformTemporalSubsample
from torchvision.transforms import Compose, Lambda, Normalize, Resize
import pytorchvideo.transforms as transforms
import pytorchvideo.data
from pytorchvideo.data.encoded_video import EncodedVideo
import imageio
import numpy as np
from IPython.display import Image
import glob
import itertools
Code snippet 1: Import the packages we use
Data Collection
For this project we aim to develop an AI-driven tool capable of interpreting (WL)ASL sign language and converting it into English speech. Interpreting short video clips of sign language and convert them into written glosses.
What is a gloss?
“When a word is associated with a sign it’s called a GLOSS: In simplest terms, a GLOSS is a label. In ASL it is an English word or words that we use to name ASL signs so that we can talk about these signs. The word or words associated with that sign do not represent the full meaning of the sign; at best they approximate its meaning. A GLOSS is a label with very weak adhesive, it’s not stuck on very securely. Some signs have several different possible glosses. For instance, the words: “IMPORTANT”; “WORTH” and “VALUE” could all be used to label the same ASL sign a gloss is a brief notation, especially a marginal or interlinear one, of the meaning of a word or wording in a text.” – Rick Mangan 2002
Training Data
To train a model that can translate a video fragment of a sign to the most probable gloss, we use the dataset “Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison”. This WLASL dataset is the largest video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in ASL.
Connect to blob storage
To develop our solution, we created a subset of signs from this huge dataset. Based on the subset (29 hand getures) we split the data into training, testing and validation to develop and test our finetuned model.
Load finetuned-trained model
The model was finetuned based on the selection of the training data. In our other blog we show how we finetuned the model to predict glosses. Here we load our model and apply it to a sequence of glosses. Now we want to use it in practice! Since we used a pretrained open weights model, which we could download and finetune using the Transformers package, we can use the same package to use the model for inference (applying it to our new series of sign videos).
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
model_ckpt = '/dbfs/mnt/handsigntospeech/Models/-sign_finetuned-2024-11-03T08/best_model/'
image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt)
model = VideoMAEForVideoClassification.from_pretrained(
model_ckpt,
ignore_mismatched_sizes=True,
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
Code snippet 2: Load the model we finetuned
Pre-processing
An important step before we can apply our model to the sequence of sign videos, we have to preprocess those videos. This preprocessing transforms the videos in the format similar to how the training videos are preprocessed before we finetuned the Transformer model. The following custom function pre-processses the videos in terms of size and encoding, in order to be classified.
# Defaults for all videos
mean = image_processor.image_mean
std = image_processor.image_std
clip_duration = 1.71
# Define the inference function for a single video
def predict_video_class(video_path, model=model):
# Set size for resize transform
if "shortest_edge" in image_processor.size:
height = width = image_processor.size["shortest_edge"]
else:
height = image_processor.size["height"]
width = image_processor.size["width"]
resize_to = (height, width)
# Define video transformations
val_transform = Compose([
UniformTemporalSubsample(model.config.num_frames),
Lambda(lambda x: x / 255.0),
Resize(resize_to, antialias=True),
Lambda(lambda x: x.permute(1, 0, 2, 3)),
Normalize(mean, std)
])
# Load and decode the video
video = EncodedVideo.from_path(video_path)
# Extract a clip from the video
video_data = video.get_clip(start_sec=0, end_sec=clip_duration)
# Apply transformations to the video
video_frames = video_data['video']
video_frames = val_transform(video_frames)
# Add batch dimension (batch_size=1) before passing to the model
video_frames = video_frames.unsqueeze(0) # Shape: [1, T, C, H, W]
# Run inference on the model
video_frames = video_frames.to(device)
# Run inference on the model
with torch.no_grad():
output = model(pixel_values=video_frames) # Pass inputs into model
# Extract logits and get the predicted class
logits = output.logits
predicted_class_idx = logits.argmax(-1).item()
predicted_class_label = model.config.id2label[predicted_class_idx]
return predicted_class_label
Code snippet 3: Function to preprocess video and score with our finetuned model
Inference
Now that we have the packages loaded and the preprocessing function ready, we can load the separate sign videos (GLOSS_1, GLOSS_2, GLOSS_3 and GLOSS_4 ) , preprocess them and run them through our finetuned model. This will result in the glosses.
import requests
import os
# Blob storage URL
blob_path = 'https://bhciaaablob.blob.core.windows.net/handsigntospeech/article sentence/'
video_names = ['GLOSS_1.mp4', 'GLOSS_2.mp4', 'GLOSS_3.mp4', 'GLOSS_4.mp4']
# Create URL list
video_list = [blob_path + video_name for video_name in video_names]
# Local storage path
save_dir = '/dbfs/mnt/handsigntospeech/'
# Ensure directory exists
os.makedirs(save_dir, exist_ok=True)
gloss_list = []
for video_name, video_url in zip(video_names, video_list):
try:
# Download video
response = requests.get(video_url)
response.raise_for_status() # Raise error if request fails
# Save video with unique filename
local_video_path = os.path.join(save_dir, video_name)
with open(local_video_path, 'wb') as f:
f.write(response.content)
# Process video using model
predicted_class = predict_video_class(local_video_path, model)
gloss_list.append(predicted_class)
print(f"Processed {video_name} -> Predicted class: {predicted_class}")
except requests.RequestException as e:
print(f"Failed to download {video_name}: {e}")
except Exception as e:
print(f"Error processing {video_name}: {e}")
# Print final results
print("Final Gloss List:", gloss_list)
2/14/2025 (6s)
19
19
import requests
import os
# Blob storage URL
blob_path = 'https://bhciaaablob.blob.core.windows.net/handsigntospeech/article sentence/'
video_names = ['GLOSS_1.mp4', 'GLOSS_2.mp4', 'GLOSS_3.mp4', 'GLOSS_4.mp4']
# Create URL list
video_list = [blob_path + video_name for video_name in video_names]
#> Processed GLOSS_1.mp4 -> Predicted class: where
#> Processed GLOSS_2.mp4 -> Predicted class: vacation
#> Processed GLOSS_3.mp4 -> Predicted class: you
#> Processed GLOSS_4.mp4 -> Predicted class: prefer
#> Final Gloss List: ['where', 'vacation', 'you', 'prefer']
Code snippet 4: Load the sign language videos and process them with our finetuned model
This result already looks pretty understandable right? ‘where’, ‘vacation’, ‘you’ and ‘prefer’ are the predicted glosses. Maybe you can guess the meaning of this set of glosses, but you will also recognize that its not that ‘fluent’ yet and can lead to misinterpretation quite easily. In American English the grammatical structure is as follows: Subject-Verb-Object (SOV) word order, however in ASL the word order depends on the topic-comment relations. Therefore, we can find multiple word orders, such as Subject-Verb-Object or Subject-Verb order and also Time-Subject-Verb-Object or Time-Subject-Verb word order. Luckily, uch in depth knowlegde of sign language is to a great extent present in most powerful pretrained language models. Therefore, in the next step we take advantage of the great power of large language models to transform the list of glosses into a naturally sounding sentence that everyone can understand.
We use the popular langchain package in combination with ollama – the hub where many large language models are made easily accessible. The template contains the prompt we engineered (prompt engineering is short for trial and error to end up with a prompt that makes the LLM do what you want it to do) to translate the gloss texts to a fluent sentence representing the meaning of the video. We use the LLM qwen2.5 32b for this task since it has a good performance in many tasks and is not too big in size (still 32 billion parameters… but had a good performance compared to other models at time of our project).
The last step is to transform the written text into audio. We can use Google’s great gtts package (Google Text To Speach) to do this with one line of code. Listen to the result below!
from typing import List
from langchain import PromptTemplate, LLMChain
from langchain_ollama import OllamaLLM
from gtts import gTTS
import os
import IPython
def gloss_list_to_speech(gloss_text_list: str, llm: OllamaLLM, template: PromptTemplate) -> gTTS:
"""
Converts a sequence of glosses into natural language text and then into speech.
Args:
gloss_text_list (List[str]): A list of strings of glosses.
llm (OllamaLLM): The language model to use for converting glosses to natural language.
prompt (PromptTemplate): The prompt template to use for the language model.
Returns:
gTTS: The generated speech object.
"""
prompt = PromptTemplate(template=template, input_variables=["text"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
gloss_text = ' | '.join(gloss_text_list)
natural_text = llm_chain.invoke(gloss_text)['text']
language = 'en'
natural_audio = gTTS(text=natural_text, lang=language, slow=False)
audio_name = gloss_text.replace(' | ','_') + '.mp3'
natural_audio.save(audio_name)
IPython.display.display(IPython.display.Audio(audio_name, autoplay=True))
return natural_audio
llm = OllamaLLM(model="qwen2.5:32b-instruct")
template = """
[INST] <<SYS>>
You are a professional WLASL (World Level American Sign Language) sign language translator. You receive a sequence of glosses, each delimited by a pipeline (|) symbol. Each gloss represents a word or phrase that conveys the meaning of a sign in WLASL. Your task is to convert the sequence of glosses into a well-structured, concise sentence that can be easily understood by someone who does not know sign language. Ensure the sentence is grammatically correct and natural in tone. Important: the output must not contain the pipeline (|) delimiter.
Here are some examples:
input: 'You | Name | What | You'
output: 'What is your name?'
input: 'You | Live | Where | You'
output: 'Where do you live?'
Only provide back the requested output, do not introduce your answer! Here is the glosses string, between three backticks: <</SYS>>
```{text}```[/INST]
Your answer:
"""
gloss_list_to_speech(gloss_list, llm, template)
Code snippet 5: Translate gloss labels to natural text and audio
Here is the result of our pipeline, that transforms a handsign video into gloss labels, those labels into a sentence everyone can understand and finally into spoken text:
We hope you like how in our project we combined a pretrained model and public data to fine tune a model, apply it to videos and use large language models and text-to-speech technologies available to translate sign videos to fluent, spoken text. This can help those who are dependent on sign language to communicate more easily with the big audience many of them deserve, not limited by the knowledge their audience has of sign language. This is how techniques can help improving life and thats why we love it!
Fine-Tuning a Video Classification Model for Hand Sign Recognition in Python
Geplaatst op: februari 28, 2025
Data Science is an ever-evolving field, and we at Cmotions are always ready to evolve with it. Each year, we select a project that pushes our boundaries—preferably by exploring a new skill, addressing a relevant societal issue, and simply engaging in a fun team challenge. This year, we set out to tackle AI video models, specifically focusing on translating (WLASL) sign language into spoken words. This project has yet again been a fantastic opportunity to work collaboratively, learn from one another, and have fun along the way. Do you want to read more about the outline of the project, read our other article. In this article we will focus more on the Python code and more specifically on the changes we made to make it work for our project. You can check out the Python code we developed on our Gitlab repository. In the following sections, we’ll dive into the model and training approach that brought our project to life.
- The data: WLASL
When starting to work on this project we first had to find suitable data, fit for our project and preferably also well known within the world of hand sign language. This search has led us to the WLASL dataset, which contains over 2000 glosses with multiple videos. In sign language, a gloss is a way to represent a sign with a word, like writing “thank you” to describe the hand gesture for saying thank you. Since the WLASL dataset provided us with 2000 of these glosses, we already had a large amount of labeled training data to start off with.
To make this data suitable for our project, we first restructured the way the files were stored. We needed them to be organized into separate train and test folders, with each containing subfolders for individual glosses. Each subfolder held the corresponding videos for that gloss. Since the dataset included a JSON overview of the files, we were able to efficiently reorganize them using simple Python code.
Here you can see the json we used as input:

And the Python code we used to restructure the files:
import os
import shutil
import json
# Load the JSON data
with open('WLASL_parsed_data_adjustedpath.json', 'r') as f:
data = json.load(f)
# Base directory where the new folders will be created
base_dir = 'data'
moved_files = 0
missing_files = 0
for item in data:
# Get the current path to the video
current_path = item['video_path']
# Check if the file exists before taking the next steps
if os.path.exists(current_path):
# Get the split (train/test/val) and gloss (label) from the JSON item
split = item['split']
gloss = item['gloss']
# Create the split and gloss directories if they don't exist
split_dir = os.path.join(base_dir, split)
os.makedirs(split_dir, exist_ok=True)
gloss_dir = os.path.join(split_dir, gloss)
os.makedirs(gloss_dir, exist_ok=True)
# Create the new path of the video
new_path = os.path.join(gloss_dir, os.path.basename(current_path))
# Move the video to the new directory
shutil.move(current_path, new_path)
moved_files += 1
print(f"the video {current_path} is moved to {new_path}")
else:
missing_files += 1
print(f"the video {current_path} does not exist")
print(f"Moved {moved_files} files and {missing_files} files are missing")
Code snippet 1: Restructure WLASL data files
When this was done, the last step we took in preparing the data was to select a subset of 29 glosses to focus our training on.
2. The model: VideoMAE
Our efforts on building a model that is capable of understanding hand sign language starts with a pretrained model: VideoMAE. This is a state-of-the-art self-supervised video pre-training model that excels at learning spatial and temporal representations from video data. Unlike traditional models that rely heavily on labeled datasets, VideoMAE uses a high masking ratio during training, enabling it to learn from vast amounts of unlabeled video footage. This makes it particularly effective for video-based tasks requiring motion understanding, so it seems like a perfect fit for our hand sign usecase. Luckily enough, the developers of the model also shared example scripts on Huggingface, helping us to hit the ground running.
Hand sign recognition relies on capturing both static hand postures and dynamic transitions between gestures—an area where VideoMAE excels due to its strong motion-centric attention mechanisms. By fine-tuning VideoMAE with a labeled hand sign dataset, we can significantly improve model accuracy while reducing the need for extensive labeled training data. In our view, this makes it a powerful and practical choice for a real-world issue in gesture-based communication and accessibility.
3. Training: eyes on the target
Training a model on hand signs means that we want to focus the weights in the neural network on the hands and face and not on the colors in the video or other (static) objects that are visible. During our first epochs, we found that the model was learning extremely fast on the training dataset but the loss on the test dataset was not dropping as we expected. We suspect that the model converged on the colors in the background, or the type of person that was making the hand sign. To overcome this, we had to introduce quite some noise in the original videos to force the model to focus on the hand movements. We used the following torch vision transformations and augmentations:
- RandomHorizontalFlip: flip the videoframe horizontally, like looking in a mirror;
- RandomRotation: randomly rotate a video frame by a specified angle;
- RandomAutocontrast: auto adjust the contrast the pixels of a video frame with a given probability;
- RandomInvert: random invert the color of pixels in a video frame;
- RandomGrayscale: randomly convert a frame to grayscale with a probability;
- ElasticTransform: transform a video frame with an elastic transformation, a sort of stretch;
- RandomInvert: invert the colors of a video frame with a probability;
- RandomDistortion: add random noise in a videoframe by altering the original tensor with a minimum and maximum value.
Transformations are done on a random sample of all video frames of a video that are fed to the model as input to train on. So, some frames are kept as original, and they are augmented on one or more of the options above. We have tried different settings to see how the model would behave, specifically if we would see an evaluation loss across all glosses.
train_transform = Compose(
[
ApplyTransformToKey(
key="video",
transform=Compose(
[
# same arguments as test set
UniformTemporalSubsample(num_frames_to_sample),
Lambda(lambda x: x / 255.0),
Normalize(mean, std),
Resize(resize_to, antialias=True),
# additional noise to avoid overfitting
RandomHorizontalFlip(p=0.4),
RandomRotation(degrees=10),
ElasticTransform(alpha=30.0),
AddDistortion(0.1),
# Use generalized RandomTransformCustom for both RandomInvert and RandomAutocontrast
RandomTransformCustom(RandomAutocontrast(p=1.0), p=0.2),
RandomTransformCustom(RandomInvert(p=1.0), p=0.3),
]
),
),
]
)
Code snippet 2: Add Random transformations in train videos
All transformations are functions of TorchVideo, except for AddDistortion. This custom PyTorch module adds a constant random “shift” (noise) to every pixel in a video tensor. Generally, it offsets the entire video by a single random value each time the forward pass is called. We observed that excessive distortion makes it difficult for the model to converge. A probability of 10% produced the best results.
class AddDistortion(torch.nn.Module):
"""
Adds distortion to a video.
"""
def __init__(self, distortion=0.5):
super().__init__()
self.distortion = distortion
def forward(self, x: torch.Tensor) -> torch.Tensor:
assert len(x.shape) == 4, "video must have shape (C, T, H, W)"
# Create a new tensor with the same shape as x, filled with random values
random_values = torch.rand_like(x) * 0 + np.random.normal(0, self.distortion)
# Add the random values to x
x = x + random_values
return x
Code snippet 3: Add Distortion in train videos
The result is a random set of frames which is transformed for all training video’s.
4. Conclusion
When looking at this article, you might think “Wow, that was easy, how could they have possibly spent so much time on doing this….” We get that, but it has not been easy at all, you must take our word for this. We have ploughed through many advanced code scripts and theoretical articles on video classification to get a sense of what is important and what would suit our use case. When finding the VideoMAE model and example scripts, it really felt like a little breakthrough. But of course, this led us to the next challenge: what helps the model learn as efficiently as possible and what limitations does the model have that we must consider.
Frames
As far as limitations go, the biggest one was the fact that the model uses a fixed number of frames from the video (16), which is embedded in the architecture of the model and thus is not something we were even thinking about changing. But it is something to be aware of, especially when deciding on how to pick these frames: random, with fixed intervals, etc.
Distortion
Then, when you have decided on how to pick the frames, the next question is distortion… What options do we have to pick from, how does each option affect the frames and, most importantly, how do we make sure the core of the video – the gesture – stays intact while playing around with all the other aspects. This took us a lot of experimenting and playing around with the different options. The result is amazingly simple, but do not let those few lines of code deceive you into thinking it was easy to create them.
Finetuning settings
Then we get to think about all the training parameters. We all know, there is no way to deduce what the “best” settings are. But we also know that when working with this type of model and data, we also do not have the compute resources to endlessly experiment with that. So, we had “thoughtfully experiment” with the settings, based on what the results we saw when just running a small number of epochs. Are the settings we ended up with the perfect settings? Maybe. Probably not. But that is something we have to deal with in our day-to-day work as well. We try to approach this from the theory as much as we can and mix this knowledge with our own experience and some experimenting until we feel comfortable with the results the model gives us.
Model performance
Ok, talking about results. How well did our model perform in the end when it comes to classifying our 29 glosses? Well, the evaluation results (which can be found in the notebook on Gitlab) reflect a highly effective model, achieving a 98.25% accuracy and a 98.32% F1 score. Next to the overall accuracy we also zoomed in on the precision and recall for each gloss. Most glosses have perfect precision and recall, indicating reliable predictions, especially for signs like “hello,” “tea,” and “work.” However, “thank you” showed weaker precision and recall, suggesting the model struggles with this sign. Overall, the model performs well BUT to evaluate its performance, we would have to use videos from the real world or other WLASL datasets. Because that is where the challenge lies of course.
The importance of Unit Testing for Data Science
Geplaatst op: november 26, 2024
Getting value out of geodata with AI: visualize the model predictions
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: explainability using SHAP
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: train the model
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: data preparation
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: convert locations to their lat and lon
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Getting value out of geodata with AI: getting started
Geplaatst op: mei 12, 2024
At Cmotions we love to get inspired by working on projects that are just a little different than our day-to-day work, that is why we have our The Analytics Lab, where we organize our so called Project Fridays. This specific Project Friday was initiated by a question concerning wolf presence coming from a biologist. Starting with a small question and some enthusiastic colleagues, it ended up in a whole bunch of Python code, an understandable xgboost model to predict wolf presence in an area, a scientific article (please let us know if you are interested in reading this article) and a series of articles to share our approach (see links below). This article is part of that series, in which we show how we gather and prepare geographical data to use in a predictive model, visualize the predictions on a map and understand/unbox our model using SHAP.
What do you need to know to start working with geographical data?
Did you ever question how Google Maps calculates the distance between two places? Or how the government keeps track of all the places where the utility services like sewers, gas and electricity pipes are? Both cases are examples of the use of Geographical Information Systems (mostly known as GIS) with geographical data. Geographical data is data related to a specific place or specific area on earth, for example, your address or some coordinates. If you want to start working with geographical data, the most important concept is GIS. A GIS is an information system with which spatial data or information about geographical objects can be saved, managed, edited, analyzed, integrated and presented.[1]
GIS systems have been around for a while and have been further developed in recent years. But how did it actually start? In the 1960s, Canadian geographer Roger Tomlinson came up with the idea of using a computer to aggregate information from his natural resources and create an overview by province. With this, the first Geographic Information System was born. In 1985, GIS was used for the first time in the Netherlands.[2]
Project information on a map
Today there are numerous possibilities to use GIS and therefore to work with geographical data. Before you make a choice in what way you want to work with it, it is important to understand some concepts. To start with map projections and associated coordinate reference systems (CRS), because how do we ensure that the round earth can be shown on a flat map, i.e. two-dimensional? In order to represent the Earth on a map with reasonable accuracy, cartographers have developed map projections. Map projections try to represent the round world in 2D with as few errors as possible. Each projection deals with this in a different way and has advantages and disadvantages. For example, one projection is good at preserving the shape, but doesn’t display the correct size of all countries, while the other doesn’t keep the right shape but is more accurate in size. If you want to see the real size and shape of the world, you will always have to look at a 3D map or globe. The following video can be recommended if you want to get some more information about the consequences of different projections: Why all world maps are wrong.

[3] https://medium.com/nightingale/understanding-map-projections-8b23ecbd2a2f
Coordinate reference systems are a framework to define the translation from a point on the round earth to the same point on a two-dimensional map. There are two types of reference systems: projected CRS and geographic CRS. A geographic CRS defines where the data is located on the earth’s surface and a projected CRS tells the data how to draw on a flat surface, like on a paper map or a computer screen.[4] Geographic CRS is based on longitude and latitude. Longitude and latitude are numbers that explain where on the round Earth you are. Longitude defines the angle between the Prime Meridian (at Greenwich) and every point on Earth, where the angle is calculated in an easterly direction. Latitude defines the angle between the equator and every point. However, latitude is calculated in two directions and all points on the Southern Hemisphere are negative. Projected CRS defines the place on a two-dimensional map instead of the round world. Here, x- and y-coordinates are used and the distance between all neighboring x- and y-coordinates are the same. [5]

[4] https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/
Different types of layers
Two other important concepts are raster and vector layers. GIS files are constructed from different map layers. These layers can be built up in two different manners. Just like that there is a difference between raster pictures and vector pictures, there are also raster layers and vector layers and it defines the way a layer is created. Raster layers consist of a collection of pixels. Vector layers, on the other hand, consist of a collection of objects. These objects can be points, lines, or polygons. Points consist of X and Y coordinates, usually latitude and longitude. Line objects are vectors that connect points. And polygons are areas on the map. Sometimes multiple areas are represented as one object, they are called multipolygons. Vector layers are commonly used when using geographical data.

[6] https://medium.com/analytics-vidhya/raster-vs-vector-spatial-data-types-11325b83852d
Saving your map
If you are working with data and you want to save your file, it is important to know which file formats exist for geo data. Where text files can be saved in .txt or .docx and Excel files can be saved in .xlsx or .csv, there are also specific file formats for geo data. The most common format for vector data is a Shapefile. A Shapefile does not consist of one file, but a collection of files with the same name and which are placed in the same directory, but all with different formats. To be able to open a Shapefile it is necessary to at least have a .shp file (Shapefile), a .shx file (Shapefile index file) and a .dbf file (Shapefile data file). Other files such as .prj (Shapefile projection file) can be included as well for extra information.[7] Another, relatively new, format is GeoPackage. This format stores vector features, tables, and rasterized tiles into a SQLite database.
Both Shapefiles and GeoPackage files can be downloaded from any GIS and can also be uploaded into any other GIS. If you keep working on the same GIS at the same directory, it is also possible to save your project as a GIS project. In that case, it is important that the data you have uploaded into your file stays at the same directory, since the project does not save the data, but the reference to the data.
Create a map yourself
Now you know the basic concepts of working with geographical data. The next step is to decide which software you want to use for geographical data. In general, you could divide it into two types of possibilities:
- Via specifically developed GIS software
- Via common programming languages
Two well-known specifically developed GIS systems are ArcGIS and QGIS. ArcGIS is a paid software where you can use the software by means of a license. QGIS, on the other hand, is an open-source software. That means that it’s free for the user. QGIS is an official project of the Open Source Geospatial Foundation, a non-profit organization that aims to make the use of geodata accessible to everyone. Both programs are similar in use and have similar functionality and capabilities.
In addition to GIS software, it is nowadays also possible to use geodata via Python or R. Several geopackages are already available which makes it possible to use geodata. A well-known geopackage for Python is GeoPandas. The goal of GeoPandas is to make working with geographic data in Python easier. It combines the capabilities of pandas and shapely.[8] Data is stored at GeoPandas in GeoDataFrames. These GeoDataFrames are similar to Pandas DataFrames, but an important difference is that a GeoDataFrame always contains a geometry column. It stores the corresponding geographic data for each row.
The usage of QGIS and GeoPandas with Python are different, but the possibilities with both options are mostly the same. Both systems are able to load different file formats and easily plot the geographical data for you. However, the visualization is better in QGIS since the map in Python is a static map and does not give you the possibility to zoom. In QGIS you can easily zoom and add a standard map layer (such as a layer from Google Maps) to put your data into a broader perspective. Furthermore, a lot of analyses are accessible in both systems. For instance, determining the distance between two places or creating a buffer around your polygons. More information regarding these analyses with geographical data are discussed in this blog.
In short, you can work with geographical data in a GIS. For this, you can use a specific software like QGIS or you can use the GeoPandas package in Python. You have different map projections and associated coordinates reference systems to plot the three-dimensional world into a two-dimensional map. Furthermore, GIS files are constructed from different map layers which can be vector layers or raster layers. And these GIS files can be saved as a Shapefile or as a GeoPackage.
So, now you know everything you need to know to start working with geographical data.
This article is part of our series about working with geographical data. The entire series is listed here:
- Getting value out of geodate with AI: getting started
- Getting value out of geodate with AI: convert locations to their lat and lon
- Getting value out of geodate with AI: data preparation
- Getting value out of geodate with AI: train the model
- Getting value out of geodate with AI: explainability using SHAP
- Getting value out of geodate with AI: visualize the model predictions
If you want to use the notebooks that are the base of the articles in this series, check out our Cmotions Gitlab page.
Sources
[1] https://nl.wikipedia.org/wiki/Geografisch_informatiesysteem
[2] https://www.esri.nl/nl-nl/over-ons/wat-is-gis/geschiedenis
[3] https://medium.com/nightingale/understanding-map-projections-8b23ecbd2a2f
[4] Geographic vs Projected Coordinate Systems (esri.com)
[5] https://desktop.arcgis.com/en/arcmap/10.3/guide-books/map-projections/about-projected-coordinate-systems.htm#:~:text=A%20projected%20coordinate%20system%20is%20always%20based%20on%20a%20geographic,the%20center%20of%20the%20grid.
[6] https://medium.com/analytics-vidhya/raster-vs-vector-spatial-data-types-11325b83852d
[7] https://www.e-education.psu.edu/geog585/node/691
[8] https://geopandas.org/en/stable/