Scraping data
Within Cmotions it is common that groups of colleagues work together on projects about topics of their interest in order to strengthen their expertise on those topics and to collaborate and learn from each other. These groups are called CIAs (Cmotions Interest Areas) and one of these groups has recently worked on a project about a meal delivery service.
Sometimes we come across data that inspires questions. As frequent customers of meal delivery services, we wondered about the questions we could answer if we had access to all the public data of one of these service providers. How many new restaurants were added to the website during covid? What would be the perfect place for a new sushi restaurant? And, to do ourselves a favor, where can we buy the best rated burger? In order to find answers to these questions, we needed to take all steps included in working with data: from storing the data, to building models and visualizing the data. However, it all starts with gathering data. Curious how we did this? Continue reading!
Nothing beats a clean set of data. Therefore, we initially contacted the company to access data through their API. Unfortunately, they had no resources available at the time, so we decided to scrape the data. This article will explain what we did in detail.
Note: scraping could be illegal, this depends amongst other things on the source and the reason why you gather the data and how you use it. Make sure you’re allowed to scrape the data for the reason you need it before you start!
The source
For those who are not familiar with meal delivery providers: on the website you can enter your postal code and subsequently it will show you a list of restaurants which are open to delivery in your area. On this page you can navigate to subpages that show you the menu and a list of reviews. In order to accommodate our many ideas for analyses, we scraped all three subpages for all postal codes in The Netherlands:
- restaurant names (including location)
- menus
- reviews
Obviously, in order to GDPR-proof our endeavor, we deleted all reviewer names.
What is webscraping?
Webscraping is a technique where software is used to extract information from webpages. We’ve built a bot that simulates human behaviour on a website: entering information, clicking, scrolling, etc.
How did we do it?
Tooling – Selenium and Python
The website was scraped using Selenium. This tool can be used to automate interactions on a webpage. Examples to think of are moving your mouse, clicking, navigating through pages, and downloading files. For this reason, it is a widely used tool for software testing, automating repetitive tasks and scraping websites. Selenium is available in several programming languages, we used the Python library.
The steps
After studying the structure of the site, the following steps were defined:
- Import a list of all postal codes in the Netherlands
- Scrape all restaurant information for each postal code
- Create a unique list of restaurant URL’s
- Fetch the menu for each restaurant
- Fetch all customer reviews for each restaurant
- Store all information in a data warehouse
Restaurants enter their delivery range in a list of 4-digit postal codes (whereas a complete Dutch postal code is 4 digits and 2 letters). To gather all restaurants, we need to enter all 4-digit postal code combinations on the website. We could have achieved this by trying to insert all 10.000 possible 4-digit combinations, but instead we only wanted to use the approximately 4.000 postal codes that actually exist. We used open data of Statistics Netherlands (CBS) for a list of postal codes of all households (January 2021). The data can easily be imported through the CBS own python package: cbsodata.
The results of steps 3, 4 and 5 are incrementally stored, so the scraper can be turned off and on (or break) without losing results.
Scraping: the technique
The selenium package is used to automate browser behavior using code. One of the options is to download the HTML code of a website. This is the same code you see when you right click on anything in your browser and choose ‘inspect element’. However, in most cases we were only interested in retrieving specific elements from the website (e.g., restaurant name, menu items, prices). This means that we want to look for small pieces of HTML which carry that useful information.
Browser -> Inspector
To give our browser tasks to retrieve information, we need to know which HTML corresponds to the elements of interest. We use the element inspector in our browser to manually find the HTML code we need. Open the website, search for an element of interest (e.g., the restaurant name) and right-click > inspect element. In the example below, we inspected one of the restaurants:
The indents show that there is a hierarchical structure in the HTML. On the top-level, you can see an element with classes ‘restaurant’ and ‘js-restaurant’ (1), both useful for selecting the element. In the subelement there is a class ‘restaurantname’ (2) where we find the title of the restaurant. With this information – the class names – we can identify the elements in Python and collect the names of all restaurants that are shown on the website. In Python, this looks as follows:
It takes time
To fetch all restaurants of one 4-digit postal code (start the browser, fill out postal code, fetch information, store information) the python script takes about 15 seconds. So, for 4.000 postal codes in the Netherlands, this took about 17 hours.
- Restaurants: 15 sec * 4.000 postal code4, a total of 17 hours
- Menus: 20 sec * 13.000 restaurants, a total of 72 hours
Although this takes quite some time, this is not an issue since this process can run in the background. Either way, improvements in code or helpful side doors on the website can lead to significant runtime reduction.
Website changed? Scraper breaks!
During the scraping of reviews, we noticed many errors. After investigation, it turned out that elements of the website had been reorganized. This meant that collecting the initially identified classes did not work properly anymore. Potential explanations for this problem are that the organization is updating the website in phases or performing A/B tests, which means that different versions of a website can be shown to the scraper.
Browser Inspector to the rescue
Fortunately, there is a solution to this problem. Apart from looking at the (HTML) code in the inspector, you can dive a level deeper into the code. To do this, you can navigate to the Network tab in the inspector and see which ‘requests’ are made.
As it turns out, you can find the API that the website itself uses to retrieve review information! This means we do not need to scrape the reviews from the website anymore, but we can directly approach the API to gather reviews.
Results
The scraping process creates four tables:
- All restaurants per postal code
- All reviews per restaurant
- All menus per restaurant
- The addresses and geo-location per restaurant (longitude and latitude)
These tables all together contain over 13.000 restaurants with about 7 million reviews spread out over 4.000 postal codes. Are you curious what is the best place to order a pizza in Utrecht? We know!
All data is stored in JSON format. How this raw data is subsequently processed to make it available for analyses in a database can be read in this article about storing the data.