How to do a Data Scrape on the Titanic

In an earlier post, I talked about the MOOC Data Journalism course that I did. Part of that course covers how to do a data scrape of information from other sites for reporting purposes.

In this post, I want to share what I learnt.

What is a Data Scrape

Scraping data is, essentially, a way of grabbing content from lists and tables on other websites.

And with this information, you can the really study it, and twist it and turn it to see what other insights you can draw out of it.

Example

As an example, consider the passengers on the Titanic. You might want to do some analysis on who there was on the ship, who survived, ages, etc.

By scraping the data from a reliable source, you can then put it into a spreadsheet and sorting, and grouping, etc, in a way that will give you the information that you want.

How to do a Data Scrape

There are several tools that you can use to do a data scrape.

The tool that I am going to describe is Google Sheets.

As described above, I’m going to scrape the list of Titanic passengers from Wikipedia.

The Titanic

Wikipedia has a list of the passengers that were on the Titanic.

The address of the Wikipedia page is:
http://en.wikipedia.org/wiki/List_of_Titanic_passengers#Survivors_and_victims

If you visit that link, you see a large list of everyone who was on board the Titanic on her maiden voyage. (It can be quite disheartening to read.)

Scraping the Data

I am going to show you how to data scrape of the passenger information so that you can put it into a spreadsheet.

  1. In your browser, go to Google Drive. (You will need to have a Google account for this.)
  2. Click on New and then select Google Sheets

    The Google Sheet will be displayed.
  3. In the first cell, enter the following:
    =importhtml


    Google will autosuggest as you are typing.

  4. Continue typing the following
    (“https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, “Table”, 1)
  5. Press enter.

Here’s the full command. You can also copy this and paste it into the spreadsheet:

=importhtml("https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, "Table", 1)

Initially, you’ll see “loading”, and then the list of passengers in First Class can be seen.

Quick Explanation of IMPORTHTML

As seen above, the command to use is IMPORTHTML

Then, between brackets, you need the following:

url the URL of the page that has the information that you want to scrape
query “Table” or “List” depending on whether the information you want is in a table, or a list.
index this is the number of the table or list that is on the web page.

 

In our case, we used:

url https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic
query “Table”
index 1

Here’s an actual example of a Google spreadsheet with the list of passengers.

And the other passengers?

As you might have noticed, the list has only the First Class passengers.

This is because the Second Class passengers and Third Class passenger are in separate tables.

So to get that data we’ll do the following:

Adding the Second Class passengers

First – let’s add an extra column so that we know which passengers are First class

  1. Go to the first empty column after the data. (In my case, it was Column H)
  2. Enter “Class” on the first row.
  3. Enter “1” on the next row.
  4. Copy that value into each cell down to the end of the table.

Now let’s add the Second Class passengers

  1. Go to the first empty row at the bottom of the table.
  2. Again, enter =importHTML
  3. And follow with
    (“https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, “Table”, 2)
    (note that the index is now “2”).
  4. Press Enter

Here’s the Example table with the Second Class passengers

In the Class column (that we created above), add the number “2”

Treating the Third Class passengers Differently

You read that right. We are going to have to handle the Third Class passengers differently.

Why?

Because, if you look closely on the Wikipedia page, the table for the Third Class passengers has an extra column.

In the tables for the First and Second Class passengers, the column “Hometown” included both the town, and the country. In the table for the Third Class passengers, the “Hometown” column has the “Town”, and there is a separate column for “Home Country”.

The extra column makes it difficult to combine it with the other data.

However, there is a workaround for this. I will be covering that in a later post.

 


Want to learn more?
(Important Disclosure)

Related Post

Journalism with Data

journalist

If you browse through the posts in this blog, you’ll see that there are several that are related to “telling a story”, “using pictures to present data, and similar:

Because I want to be able to present data graphically, in a proper way, I have started an online course titled: “Doing Journalism With Data: First Steps, Skills and Tools“.

It’s a 5-module online (MOOC) introductory course  that “gives you the essential concepts, techniques and skills to effectively work with data and produce compelling data stories under tight deadlines.

Awfully exciting stuff! It’s actually being taught by 5 tutors (one for each module) from Britain, America, and France. Here are the five modules:

Module 1 – Data journalism in the newsroom
Module 2 – Finding data to support stories
Module 3 – Finding story ideas with data analysis
Module 4 – Dealing with messy data
Module 5 – Telling stories with visualisation

You can read more about the course here.

I’ve just started module 1 (along with 21,280 other students), and I’m keen to work my way through the rest of the modules.

At the end, I’ll give an idea what I thought of the course along with any real gems that I got out of it.

 

 

  • Journalism Course
  • Launching a MOOC for data journalism
  • Top 10 skills new journalists should have

Related Post

Asking Stupid Questions – what can happen?

People consider that you are asking stupid questions when …

… that question that can be answered by using Google.

While easy to do, consider this – by simply finding the answer straight away, it removes the opportunity for dialogue, for discussing, and learning…

For example, I want to know what HTML5 is. I could go to Google, (or Bing, or any search engine) type the four letters and one numeral in, and get an abundance of results.

However, if I ask someone, there are a number of outcomes:

Do you see what happened there?

The easy solution was to Google the answer. Simple, easy & fast.

However, by asking someone, I engaged in dialogue, and when the person started explaining the answer, the dialogue started becoming rich, and each interaction created new richness.

People communicating,and sharing ideas, thoughts, knowledge, concerns is, actually, a pretty great thing. :O)

I would love to hear what your experiences are?

Oh, and by the way, if you like this post, share it with others.

Related Post

5 FREE Computer Tools for Every 21st Century Teacher

Alanna is a music teacher and ICT coordinator with “a passion for everything education related.

Recently she put together a list of 5 FREE Computer Tools for Every 21st Century Teacher.

It’s a great resource. I encourage you to visit her site and give her a word of encouragement for the great work she is doing!

Related Post

It’s like working in an Encyclopaedia

Everyday I am truly in awe. 

I work in an open-plan office where there are developers, designers, hardware people, project managers, business analysts and a few who I am not sure what they do.

It’s a great work environment, and one that I have found to be incredibly educational.

Because it is open-plan,and because all the  developers, designers, hardware people, project managers, and business analysts are so passionate and enthusiastic about what they do, I get to sit in on some very interesting discussions  (Hell, sometimes I’m almost able to contribute something useful to the conversations.)

A great example is the other day. In that one day I was able to listen to two designers talk with passion about design techniques, as well as some of the new technology available. Then I was involved with a group of business analysts discussing a successful project that had taken place. Later that day I was able to follow another passionate discussion related to UI design, and usability. And then I had a chance to sit in on a debate between two developers on the benefits, and downsides, of Scrum and Kanban.

I always left these discussions feeling like I had just been watching a TED talk, or had been reading through an Encyclopaedia.

Related Post

Carl’s obstacles revisited

In one of my earlier posts I talked about my friend Carl. If you recall, Carl was full of great ideas, and had passion and enthusiasm. This passion and enthusiasm, however, was just not harnessed in the right way. End result…Carl was despondent and feeling depressed. He ended up leaving his employer, which was a pity.

Forbes have published an article that lists 10 mistakes big companies make when it comes to keeping their talented staff. Looking at these, most of them seemed to apply to Carl. Namely:

1. You Failed To Unleash Their Passions: Carl was passionate about what he was doing.  This was not recognised.

2. You Failed To Challenge Their Intellect: Carl was given a slap on the wrist for “doing the wrong thing”, and was relegated to mundane tasks that did not stimulate his intellect at all.

3. You Failed To Engage Their Creativity: Carl was looking for new ways of doing things. He was trying to be innovative. He was told to stop “thinking outside the box.”

4. You Failed To Develop Their Skills: Not getting any support from his managers, Carl was attempting to develop his skills himself. This was perceived as “wasting time”.

5. You Failed To Give Them A Voice: Carl had ideas. No-one listened to them though.

6. You Failed To Care: At one stage, Carl really enjoyed his job. He worked hard, and often (at least in his eyes) went “above and beyond”. He knew that it was about ‘give and take’. He didn’t expect to be rewarded, but was hoping to get a little bit of respect.

7. You Failed to Lead:Carl was not getting the leadership that he needed. In the end – he tried to use his initiative to find his own way.

8. You Failed To Recognize Their Contributions:

9. You Failed To Increase Their Responsibility: In Carl’s situation, his responsibilities actually decreased. This had a big impact of Carl’s morale.

10. You Failed To Keep Your Commitments:

The above-mentioned “mistakes” are the ones that really capture the frustration that Carl was experiencing. Also read the original Forbes article for an extended explanation.

  • Avoid Losing Your Most Talented Employees
  • 10 Reasons Your Top Talent Will Leave You
  • Courage, Failure, & Leadership
  • 15 Ways To Identify Bad Leaders

Related Post

Dousing the flame

 
A friend of mine gave me a call the other day.

His name is Carl. I’ve known him for a long time so he lets me call him “Carl”.

Carl’s a young guy and has been quite passionate about the computer world. He has his own blog and used to write quite some eclectic material. Recently, however, he had been rather reticent with his ponderings.

As I hadn’t spoken to Carl for awhile, I arranged to meet him for a drink. It was quite at the bar. There was a group of guys who seemed to be discussing how to run a “search” project, but we tried to steer clear of that. We headed to a quite table and ordered a drink.

After the usual small-talk I asked him, directly, what was happening. Why hadn’t he written any blog posts recently.

“They’ve killed me”, he said. “Huh?! – what do you mean?” I replied almost choking on my beer. Carl went on “I’m dead…the passion’s gone”. I grilled Carl a bit more and gradually the story came out.

As I mentioned above Carl was really enthusiastic about the computer industry. He wrote some great blog posts, and would attend industry shows, and user group sessions when he could, just to see what the latest thing happening was and also to learn from others. In fact Carl had built up a great circle of what he called “Social 2.0 friends”. (I was a “social 1.5” friend according to him). And Carl was happy. He liked learning.

He wasn’t always like this. Once upon a time Carl was just a standard “computer guy”. He did his work well, but when he was at home he didn’t really do anything special. He watched TV, he went to the movies with some of his Social 1.0 friends, and that was pretty much it. Any  “further education” he got, any training, was always related to his job.

Then Carl had decided to improve himself. He started cautiously with his blog. (This is when I got to know him.) And he started reading more and more. Not only things that were related to his job, but articles and posts that discussed all facets of the computer industry. He even expanded this to include things that, on the surface, had nothing at all to do with computers.

I had been following Carl’s progress for awhile, and I could see that he was growing, and developing. Normally Carl was a reserved guy with not much self-esteem, but I could see a new confidence appearing. In our rare face-to-face opportunities, Carl had also mentioned the same. He was enthusiastic and didn’t want to stop.

But then, apparently, someone, where Carl worked, had taken exception to all this. Someone from high-up had come way down to talk with him. “Carl’, they had said (apparently) “You are wasting time. What has all this to do with your work. You’re clearly a ‘fuzzy thinking’. You write all this crap, but with no real value.” Carl had tried to protest, but he was too shocked. “We’ve read a lot of what you’ve written…90% of it is just cut-and-paste bullshit. You don’t write anything original.”

And this is when Carl “died”. “After hearing that,” he said to me quietly, “I just lost the passion.” “I thought I was doing so well, and was hoping that someone would recognise the potential I was showing.” “Instead, they just want me to plod through my job”

I bought Carl another beer and let him rave on a little bit more. I wanted to tell him that what he was had been doing was brilliant, and how he had really been making leaps and bounds in not only his knowledge but also in the sort of person he was. He had now “drive” and a voracious appetite for discovery. Carl was feeling so morose at this stage that it seemed that nothing I would say would make a difference.

We decided to call it a night. We sidled past the group of “search project” guys and headed out the door. I ordered Carl a taxi and got one myself.

Later that evening, back in my flat, I had a chance to think about Carl’s situation. It seemed a shame that his newly developed talents were not being recognised. In fact, it seemed the opposite. Even though Carl’s blog was a hobby, it seemed to have been used against him. I know, myself, that writing a blog often “exposes” you as a person. If you want to write something “real”, your ideas, your opinions, and your personality will get reflected in the blog posts. And it seemed that this had allowed Carl’s employer to make a judgement on who Carl was in the workplace. And this was a shame.

In an ideal world, what Carl was doing, the metamorphosis that he had achieved, would be recognised and utilised somehow. Instead, it seemed in Carl’s case, that someone had decided that this “new Carl” wasn’t fitting nicely into the hole that he was meant to be fitting into.

Carl’s flame has been doused. And that’s a real pity. I’m trying to give Carl real encouragement so that he won’t “lose himself”.

Carl – if you’re reading this, don’t be silenced. Be yourself.

Related Post

Learning how to use Google+ uses (or “I’m new here – 2”)

As I mentioned in my previous post, I didn’t really “get” Google+.  (Having never actually having used Facebook).

Just after writing that post, I found this article, by Matt Heinz, in which he said:

You can drive yourself nuts and waste loads of time chasing after every new technology, gadget, productivity tool, social network or other flavor of the week.

And if you insist on being the first to try everything and position yourself on the bleeding edge, knock yourself out.

In the article was also a link to another useful article in which Chris Brogan lists 50 useful things to do with Google+.

The Google+ 50

 

I’m going to read through this.

Maybe this social media troglodyte will learn something…

Related Post

Is Microsoft a Religious Experience?

A Tweet by @pelujan the other day started me thinking. The tweet was:

I responded to his tweet because I do remember “workflo”. It was something that FileNet developed back in 1985. I admit that this was indeed 10 years before I got into IT (having spent those 10 years doing stuff in laboratories), but I was very aware of it as it played a big part in a lot of their technology.

In fact, my first introduction to ECM was PC Docs, and also FileNet’s early Content Management application “Saros Mezzanine”. This was followed by their Image Management Services application running on an AIX system. It stored scanned images on WORM disks in an OSAR unit, and had a robotic arm jukebox. It was a bloody impressive , but also daunting, system (especially when you are new on the job, and you’ve been told to support this system at a very hostile client site).

Over the years I got more an more involved with FileNet and their products, getting to know the idiosyncrasies of each one. I worked as a consultant, and each client had its own unique requirements, environments, and situations.  Very often I would go home  at the end of the day feeling beaten up.

At the end of 2006 I moved into a position working with Documentum, and quickly after, SharePoint. However, this time, I was the client, and so if something didn’t work, someone else was responsible for “fixing it”. This gave me more time to think about the potential of the systems in terms of the industry I was now working in. I actually went home feeling a lot more relaxed.

Now, the one thing that always struck me, when I was working with FileNet, was that, compared to a Microsoft product, there was not a lot of material available. The majority of what you learnt came about through personal experience. You were on the battle field getting the scars. You felt that you had “earned it”.

Of course, there were forums available, and FileNet themselves had a great store of answers to questions, etc. (I used to trawl their partner site just to pick up nuggets of knowledge). Documentum (now EMC) have the same thing which I still use.

At the end of the last century (gawd – that sounds awful) I got my MCSE, and have kept up to speed with Microsoft technology since then. In 2007 I developed a Portal site that hooked into Documentum, and then, having got some scars with that, I got my SharePoint 2007 certification.


Is Microsoft a Religious experience?

Now I am trying to build up my knowledge of SharePoint 2010. This time I’m trying to take a more business application view of the technology. I did AIIM’s SharePoint Master course, which gives a more “real” view of SP2010, especially with regards to Document Management. (See this post, and this one.) However, I realise that it’s still handy to have the MS certification under my belt, so I am working towards Microsoft SP2010 certification also.

I’m don’t want to pay for a course, and so I’m using the over-abundant resources that can be found on the internet (white papers, MS videos, MS learning material, etc). The more material I cover the more I am aware that the same message is being thrown at me – “how great SharePoint 2010 is”. (I’m not going to get into a discussion regarding this, as this has been covered by multitudes of blogs and forums on the internet).

The fact is I find myself slowly, (and blindingly), convinced. I’ve started chanting the mantra, and doing the dance.

Microsoft has produced so much stuff on their latest “shiny object”. It’s amazing. There books, videos, whitepapers, forums, faqs, technet articles, etc, etc, etc. There is also a conference/user group/gathering for the devout, almost every second week. And there are “evangelists” – people who spread the Word.

Got to admit, I am going to one of these conferences in April – the Best Practices Conference, being held in London (#bpcuk). The US one has just finished, and I was following the tweet stream (#bpc11). The funny thing was – I got to the point where I was “religiously” checking on the progress of the conference, and the activities of the participants (albeit the more “tweetal”  – think of the word “vocal” but in terms of tweeting – amongst them). And I found myself just wishing I was there, wishing I was with these people and seeing, and sharing, what they were. (Quick – slap me!)

I never got this “ecstatic feeling” with FileNet. It was all mud and barbed wire. You were earning your stripes “old school”. And even though I have attended the Documentum user group conferences (Momentum) for a few years now (which is one of the high-points of my year – have only missed one over the last 5 years), I’ve never felt the (illogical, zealot-like) fervour that I am starting to experience now.

Related Links

  • Is the SharePoint Community Past Its Prime?
  • Best Practices Conference 2011 – Europe
  • Best Practices Conference 2011 (US) Twitter activity (thanks to @VeroniquePalmer)
  • Momentum (2010)
  • AIIM SharePoint Course

http://geekswithblogs.net/SoYouKnow/archive/2011/06/14/is-the-sharepoint-community-past-its-prime.aspx

Related Post

Trust the Process

I saw a tweet today from @visualloop with a link to a graphic that really hit me.

It is a graphic that sums up the characteristics of achieving success, and was created by Garrick Gibson, and features in a post he wrote about his road to where he is now.

I liked this graphic because he sums it up so well.  It’s advice that I’d like to pass along.

Success achievment goals dream

(Note – you can see his post by clicking on the graphic.)

———————————————————————————————————————-

Related Post