I’ve just finished Module 2 of the MOOC Data Journalism course (that I mentioned in an earlier post).
The description for this module is:
“This module deals with the range of skills that journalists use to obtain data. This includes setting up alerts to regular sources of information, simple search engine techniques that can save hours of time and using laws in your own and other countries.”
And (like all the other Modules) is made up of four parts:
- Setting up ‘data newswires’
- Strategic searching – tips and tricks
- Introduction to scraping
- Data laws and sources
In Part 3, I learnt to do some basic data scraping. This, essentially, is a way of grabbing content from lists, and tables, on web sites.
We covered a few tools that make this possible. The one that did surprise me was that you can use a spreadsheet created in Google Drive.
The command is IMPORTHTML(url, query, index)
Just as a practice I used it to scrape the list of Titanic passengers from Wikipedia.
Here’s the Wikipedia link: http://en.wikipedia.org/wiki/List_of_Titanic_passengers#Survivors_and_victims
And here is the Google spreadsheet that I imported the data to: https://docs.google.com/spreadsheets/d/1g_ngM049ZgAPh25UXMmwWwMDlvGY_nD6aGbFLhBJwXo/edit?usp=sharing
It was my first scraping, and nothing fancy. Also the data does need a bit of cleaning (in one case, there was extra info in the HTML that the scraping pulled in).
Also, this functionality is not just available in Google Spreadsheet. I have read that Excel can also do this. If you know of any more, please let me know.