This is the third blog in our series about working with geodata. In our first blog, we have discussed the core concepts involved in working with geodata, like GIS, projections (CRS) and longitude and latitude.

This blog builds on the concepts laid out in our first blog. We will go into what data preparation is, and how you can approach it. Then, we will take a deeper look at what is involved in geo analysis. Before we start, we've added a quick recap and some core concepts below, which will help in better understanding the things we’ll be discussing here.

GIS: A Geographical Information System (GIS) is an information system with which spatial data or information about geographical objects can be saved, managed, edited, analyzed, integrated and presented [1]

Projection / CRS: a Coordinate Reference System (sometimes ‘Spatial Reference System’) defines how all places on earth can be linked to a two-dimensional projected map [2]. It is a framework used to precisely measure locations on the surface of the Earth as coordinates [3]. Combined with a projection, which allows to show the round 3D world on a flat 2D map, a CRS helps in plotting locations on earth to a map.

A raster is data represented in a grid, with rows and columns forming cells. Raster layers consist of a collection of pixels. Vector layers, on the other hand, consist of a collection of objects. A vector is data based on coordinates and can show these objects in three different forms: points (for example city locations on a map), lines (for example roads and railways, or polygons (for example city areas).

How to: fast track into general data prep¶

Data preparation is the process of readying your data for further use. Part of this proces is, for example, making sure your data is consistent and valid. Data preparation usually involves several steps, the first of which is collecting your data in the first place. When you have gathered your data, next steps involve cleaning and validation. There could be more steps involved depending on your needs and data (for example: structuring, deduplication and enrichment), but today we will not go too deep into all of these. Moreover, the steps involved in prepping your data can be seen as more or less continues or circular, so you will often revisit previous steps while moving forward.

In prepping our data, we will be using Pandas. "Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language" [4]

You will often see that most data sets come with null values (meaning that they are empty / contain no value), have inconsistent data types within a column (e.g. text and numbers) or have incorrect formatting (numbers stored as text). Moreover, when combining multiple sets, which happens in most cases, you’ll often see different formats between your sets as well. These are just some examples of things you would want to fix before further working with your data. Luckily, Pandas comes with some built-in functions to help in taking these steps.

Below is an example of a minimal approach.

In [ ]:
# import Pandas library
import pandas as pd

# Read example source file..
df = pd.read_csv('/path/to/file.csv')

# .. and print a preview of your data
df.head()

This will give you a first glance at your data set. From here, you can take a further look into the data and start the preparation for further use. Below are some examples for cleaning your data in its simplest form

In [ ]:
# Drop rows with missing values. By adding 'axis = 1' you can drop empty columns
df = df.dropna()

# Fill empty values
df = df.fillna("your_value")

# Drop duplicate rows
df = df.drop_duplicates()

# Filter your set on specific values
df_1 = df.loc[(df["column_1"] == "A")]
df_2 = df.loc[(df["column_2"] == 5)]

# Merge two sets together
merged = pd.merge(df_1, df_2)

# Note that instead of reassinging your df everytime, you could also use 'inplace = True'
df.fillna(0, inplace = True)

Another step in prepping your data is finding the relevant data you need for your purpose. You’ll often see that not all data in a set is needed, and removing / combining / finding the right data will improve the further usability of your data greatly. As mentioned, note that you’ll often move back and forth between cleaning and validating your data to some extent.

In [ ]:
# Say, for example, your set contains a lot of columns containing no data, or at least not the data you will be using.
# You could just remove these columns as shown below:
df = df.drop("column_name", axis = 1)

# but it's good practice to always work on a copy of your dataframe instead of editing your original set. We will simply select the columns we want to use.
df_clean = df[["column_1", "column_2", "column_3"]]

# To firt check weather a column contains empty values, you can check this using 'notna()' which returns a mask with True or False
df.notna()

Data prep for geodata¶

When working with geodata specifically, in addition to the above, using a geopackage is super helpful (you could even say: necessary). There are several possibilities, but since we are working with Python and Pandas, we have chosen to use GeoPandas. Geopandas allows you to work with points and lines, but also works with coordinates and addresses. In other words, GeoPandas simplifies working with geodata in Python, and comparable to Pandas, uses dataframes to store data. The big difference between regular dataframes and geodataframes is that the latter always has a geometry column. This column contains the geographical data per row in the data set. Also, geopandas uses geoseries, rather than series in Pandas (as a refresher, series are one-dimensional arrays and are the main data structure in Pandas).

Below is a setup example.

In [ ]:
# Import library
import geopandas as gpd
from shapely import wkt

# Load data
df = pd.read_csv('/path/to/file.csv')

# Our file already contains a gemoetry column. Prep to make sure the geometry column will be recognized by geopandas
df['geometry'] = df['geometry'].apply(wkt.loads)

# And convert to a geodataframe, with a speciifc crs
gdf = gpd.GeoDataFrame(df, geometry='geometry', crs='4326')

# Alternatively, you could convert a Pandas Dataframe to a GeoPandas Dataframe by first pointing to a lat and long column to create a geometry
gdf = gpd.GeoDataFrame(df, geometry = gpd.points_from_xy(df.decimalLongitude, df.decimalLatitude), crs='4326')
Buffering¶

In working with geodata, there are some extra data preparation steps possible compared to working with 'standard' data. Some examples of the things you will often want to calculate are the distances between certain points, find out which points are closest to each other, or check which points are within x distance from a certain point. Two of these possibilities are clipping and buffering.

A buffer is "a GeoSeries of geometries representing all points within a given distance of each geometric object." [5]. Simply put, you can think of this as a circle with 'x' radius around a point. A buffer creates a polygon, which represents an area rather than just a single point.

Below we are creating a 5 kilometer buffer around our data points to create polygons. Pay close attention to your crs projections in doing this! Remeber from our previous blog that this is important because there are two types of reference systems: projected CRS and geographic CRS. Geographic CRS is the most common one to use, and is based on longitude and latitude. With a geographic CRS, latitude is calculated in two directions and all points on the Southern Hemisphere are made negative. With a projected CRS, places are defined on a two-dimensional map instead of the round world. Then, x- and y-coordinates are used and the distance between all x- and y-coordinate are the same [6]

In [ ]:
# Create a buffer by inputting a radius. It can be as simple as this:
gdf_w_buffer = gdf.geometry.buffer(5000)
Clipping¶

Another interesting geodata preparation task is to find overlap between, for example, areas. When having overlap, you can use a technique called 'clipping' to find only the overlapping area. This can be as simple as using gpd.clip, which takes your GeoDataFrame and a mask as input and returns a GeoDataFrame with the data from your original GeoDataFrame clipped to the boundaries of the mask. You can think of the mask as the part that you are interested in for your result, essentially 'filtering' your GeoDataFrame to a specific (sub)set. Clipping is based on locations, and basically uses two world layers on top of eachother, after which clipping takes place, and so creates another layer of data in your set.

Of course, the overlap between areas can depend heavily on your chosen buffer. Therefore, you might want to explore other options to find the distance between points, for example by finding the distance from the center of an area using centroids (which is the middle point of a shape or geometry), or nearest neighbours to find which points are closest to a certain point.

So there you have it. The above is just a quick guide to setting up your projects and preparing your (geo)data for further use. In using packages like Pandas and GeoPandas, your work can be greatly simplified, allowing you to start working with your data in a concise way and making sure your data is ready to go. This blog is meant to help in starting up, and is not extensive. There are, of course, many more things to think about in working with geodata, which we'll cover our other blogs in this series.

Sources¶

[1] https://en.wikipedia.org/wiki/Geographic_information_system / https://nl.wikipedia.org/wiki/Geografisch_informatiesysteem

[2] https://www.esri.nl/nl-nl/over-ons/wat-is-gis/geschiedenis

[3] https://en.wikipedia.org/wiki/Spatial_reference_system

[4] https://pandas.pydata.org/

[5] https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.buffer.html

[6] https://desktop.arcgis.com/en/arcmap/10.3/guide-books/map-projections/about-projected-coordinate-systems.htm#:~:text=A%20projected%20coordinate%20system%20is%20always%20based%20on%20a%20geographic,the%20center%20of%20the%20grid