In an earlier post, I talked about the MOOC Data Journalism course that I did. Part of that course covers how to do a data scrape of information from other sites for reporting purposes.
In this post, I want to share what I learnt.
What is a Data Scrape
Scraping data is, essentially, a way of grabbing content from lists and tables on other websites.
And with this information, you can the really study it, and twist it and turn it to see what other insights you can draw out of it.
Example
As an example, consider the passengers on the Titanic. You might want to do some analysis on who there was on the ship, who survived, ages, etc.
By scraping the data from a reliable source, you can then put it into a spreadsheet and sorting, and grouping, etc, in a way that will give you the information that you want.
How to do a Data Scrape
There are several tools that you can use to do a data scrape.
The tool that I am going to describe is Google Sheets.
As described above, I’m going to scrape the list of Titanic passengers from Wikipedia.
The Titanic
Wikipedia has a list of the passengers that were on the Titanic.
The address of the Wikipedia page is:
http://en.wikipedia.org/wiki/List_of_Titanic_passengers#Survivors_and_victims
If you visit that link, you see a large list of everyone who was on board the Titanic on her maiden voyage. (It can be quite disheartening to read.)
Scraping the Data
I am going to show you how to data scrape of the passenger information so that you can put it into a spreadsheet.
- In your browser, go to Google Drive. (You will need to have a Google account for this.)
- Click on New and then select Google Sheets
The Google Sheet will be displayed. - In the first cell, enter the following:
=importhtml
Google will autosuggest as you are typing. - Continue typing the following
(“https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, “Table”, 1) - Press enter.
Here’s the full command. You can also copy this and paste it into the spreadsheet:
=importhtml("https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, "Table", 1) |
Initially, you’ll see “loading”, and then the list of passengers in First Class can be seen.
Quick Explanation of IMPORTHTML
As seen above, the command to use is IMPORTHTML
Then, between brackets, you need the following:
url | the URL of the page that has the information that you want to scrape |
query | “Table” or “List” depending on whether the information you want is in a table, or a list. |
index | this is the number of the table or list that is on the web page. |
In our case, we used:
url | https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic |
query | “Table” |
index | 1 |
Here’s an actual example of a Google spreadsheet with the list of passengers.
And the other passengers?
As you might have noticed, the list has only the First Class passengers.
This is because the Second Class passengers and Third Class passenger are in separate tables.
So to get that data we’ll do the following:
Adding the Second Class passengers
First – let’s add an extra column so that we know which passengers are First class
- Go to the first empty column after the data. (In my case, it was Column H)
- Enter “Class” on the first row.
- Enter “1” on the next row.
- Copy that value into each cell down to the end of the table.
Now let’s add the Second Class passengers
- Go to the first empty row at the bottom of the table.
- Again, enter
=importHTML
- And follow with
(“https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, “Table”, 2)
(note that the index is now “2”). - Press Enter
Here’s the Example table with the Second Class passengers
In the Class column (that we created above), add the number “2”
Treating the Third Class passengers Differently
You read that right. We are going to have to handle the Third Class passengers differently.
Why?
Because, if you look closely on the Wikipedia page, the table for the Third Class passengers has an extra column.
In the tables for the First and Second Class passengers, the column “Hometown” included both the town, and the country. In the table for the Third Class passengers, the “Hometown” column has the “Town”, and there is a separate column for “Home Country”.
The extra column makes it difficult to combine it with the other data.
However, there is a workaround for this. I will be covering that in a later post.
Want to learn more?
(Important Disclosure)