Small Brain Notes on ‘Big Data”

Big Data… that was a term that I had heard a lot, but never really understood, exactly, what it meant.

Thanks to a whitepaper written by Mike Ferguson, of Intelligent Business Strategies, I don’t have to fake it any more.

Mike’s excellent paper (“Architecting A Big Data Platform for Analytics“) explains what Big Data is, as well as what is needed to “do” Big Data. It is, however, quite wordy and, for my own sake, I have tried to capture the essence of his words of wisdom in my “Small Brain Notes”… 

(Note – the notes below are just a condensed, and slightly modified, version of Mr Ferguson’s work)

Architecting A Big Data Platform for Analytics – Small Brain Notes

Traditional analytical systems are based on a classic pattern where data from multiple operational systems is captured, cleaned, transformed and integrated before loading it into a data warehouse.

BI tools can be used to analyse, compare and report on business performance over time.

Many new complex types of data are emerging that businesses want to analyse to enrich what they already know, and the rate at which much of this new data is being created and/or generated is far beyond what we have ever seen before:

  • Data on social networks
  • Review sites
  • On-line news items
  • Weather data,
  • Competitor web site content, and even
  • data marketplaces
  • Web logs
  • Archived data
  • Increasing amounts of sensor networks

Complexity is growing in both the characteristics of the data itself and in the types of analyses businesses now want to perform.


With regards to data, complexity has increased in the following ways:

  • The variety of data types being captured by enterprises
  • The volumes of data being captured by enterprises
  • The velocity or rate at which data is being generated
  • The veracity or trustworthiness of the data

Variety of Data Types

New data types are now being captured by enterprises. These include:

  • Semi-structured data e.g. email, e-forms, HTML, XML
  • Unstructured data e.g. document collections (text), social interactions, images, video and sound
  • Sensor and machine generated data

This collection of new more complex data types is often referred to as multistructured data.

A major problem with multi-structured data is that it is often unmodelled and therefore has to be ‘explored’ to derive structured data from it that has business value.

Investigative analysis needs to done on multi-structured data upstream of any traditional analytical environment to identify data that could enrich what is already stored in existing data warehouses.

Data Volume

The rate at which companies are accumulating data is also increasing leading to much larger data volumes. Examples include: collections of documents and emails, web content, call data records (CDRs) in telecommunications, weblog data and machine generated data.

These sources can run into hundred of terabytes or even into petabytes.

Velocity of Data Generation

The rate at which data is being created is increasing rapidly.

Example – Financial markets data is a good example where data is being generated and emitted at very high rates and where there is a need to analyse it immediately to respond to market changes in a timely manner.

Other examples include sensor and machine generated data where the same requirement applies, or cameras requiring video and image analyses.


New algorithms and several types of analysis are needed to produce the necessary insight required to solve business problems.

Each of these analyses may need to be done on data that has different characteristics in terms of variety, volume and velocity.

E.g.: Retail Marketing:

  • Historical analysis and reporting of customer demographics and customer purchase transaction activity (structured data) to determine customer segmentation and purchase behaviour
  • Market basket analysis to identify products that sell together to identify cross-sell opportunities for each customer
  • Click-stream analysis to understand customer on-line behaviour and product viewing patterns when traversing web site content to produce accurate up-sell offers in real-time
  • Analysis of user generated social network data such as profiles (e.g. Facebook, LinkedIn), product reviews, ratings, likes, dislikes, comments, customer service interactions etc.
  • Real-time analysis of customer mobile phone location services (GPS) data to detect when a customer may be in the vicinity of an outlet to target them with offers that tempt them to come in

Determining the insight needed to solve a business problem is now a process involving multiple analyses on different data sources where both the data and the analyses vary in complexity.

Analysis of both structured and unstructured data may be needed in any single analytical process to produce the insight required.

Data integration is required to merge multi-modal data to improve actionable insights

Given that some data sources may be un-modelled, the steps in an analytical process cannot all be done on a single analytical platform and require multiple underlying technologies to solve the business problem.

What is Big Data?

The spectrum of analytical workloads is now so broad that it cannot all be dealt with in a single enterprise data warehouse

The new environment includes multiple underlying technology platforms in addition to the data warehouse, each of which is optimised for specific analytical workloads.

It should be possible to make use of these platforms independently for specific workloads and also together to solve business problems.

Big Data is, therefore, a term associated with the new types of workloads and underlying technologies needed to solve business problems that was not previously possible due to technology limitations and/or prohibitive cost.

Big Data is not just about data volumes, It is also about complexity.

Big Data analytics is about analytical workloads that are associated with the combination of data volume, data velocity and data variety that may include complex analytics and complex data types.

Big Data can be associated with both structured and multi-structured data.

For this reason the Big Data analytics can include the traditional data warehouse environment because some analytical workloads may need both traditional and workload optimised platforms to solve a business problem.

The new enterprise analytical environment encompasses traditional data warehousing and other analytical platforms best suited to certain analytical workloads.

Analytical requirements and data characteristics will dictate the technology deployed in a Big Data environment. Big Data solutions can be implemented on a range of technology platforms including:

  • streamprocessing engines,
  • relational DBMS,
  • analytical DBMS (e.g. massively parallel Data Warehouse appliances) or
  • on non-relational data management platforms such as a commercialised Hadoop platform or a specialised NoSQL data store e.g. a graph database.

It could be a combination of all of these that is needed to support business requirements.

Types of Big Data

Types of data frequently associated with Big Data analytical projects include:

  • web data,
  • industry specific transaction data,
  • machine generated/sensor data
  • and text.

Web data includes web log data, e-commerce logs and social network interaction data e.g. Twitter streams

Industry specific transaction data examples include telecommunications call data records (CDRs) and geo-location data, retail transaction data and pharmaceutical drug test data

Machine generated / sensor data is one of the fastest growing areas. Sensors exist to monitor everything from movement, temperature, light, vibration, location (e.g. inside smart phones), airflow, liquid flow and pressure. More and more data generating electronic components going into other products all of which can be connected to the Internet to flow data back to collectors and command centres.

Text includes archived documents, external content sources or customer interaction data


Technology advances now make it possible to analyse entire data sets and not just subsets. For example, every interaction rather than every transaction can be analysed. The analysis of multi-structured data may therefore produce additional insight that can be used to enrich what a company already knows and so reveal additional opportunities that were previously unknown.

There are still inhibitors to analysing Big Data. Two reasons for this are as follows:

  1. The shortage of skilled people and
  2. Confusion around what technology platform to use


Industry Use Case
Financial Services Improved risk decisions “Know your customer” 360º customer insight Fraud detection Programmatic trading
Insurance Driver behaviour analysis (smart box) Broker document analysis to deepen insight on insured risks to improve risk management
Healthcare Medical records analytics to understand why patients are being re-admitted Disease surveillance Genomics
Manufacturing ‘Smart’ product usage and health monitoring. Improved customer service by analyzing service records Field service optimization Production and distribution optimization by relating reported service problems to detect early warnings in product quality and by analysing sensor data
Oil and Gas Sensor data analysis in wells, rigs and in pipelines for health and safety, risk, cost management, production optimization
Telecommunications Network analytics and optimization from device, sensor, and GPS inputs to enhance social networking and promotion opportunities
Utilities Smart meter data analyses, grid optimisation Customer insight from social networks

Web data, sensor data and text data have emerged as popular data sources for big data analytical projects.

With respect to web data, analyses of clickstream and social network content have been popular. Web log data is often analysed to understand site navigation behaviour (session analysis) and to link this with customer and/or login data.

Analysis of machine generated / sensor data is being adopted for supply/distribution chain optimisation, asset management, smart metering, fraud and grid health monitoring to name a few examples.

In the area of unstructured content, text in particular is being targeted for analysis. Case management, fault management for field service optimisation, customer sentiment analysis, research optimization, media coverage analysis and competitor analysis are just a few examples of Big Data analytic applications associated with unstructured content.


There are a number of Big data analytical workloads that extend beyond the traditional data warehouse environment:

  • Analysis of data in motion
  • Exploratory analysis of un-modeled multi-structured data
  • Complex analysis of structured data
  • The storage and re-processing of archived data
  • Accelerating ETL and analytical processing of un-modeled data to enrich data in a data warehouse or analytical appliance


The purpose of analysing data-in-motion is to analyse events as they happen to detect patterns in the data that impact (or are predicted to impact) on costs, revenue, budget, risk, deadlines and customer satisfaction etc.

This type of big data analytical workload is known as event stream processing and is most often used to support every day operational decisions where all kinds of events can occur throughout a working day.

Examples include:

  • a sale of shares on the financial markets,
  • a price change,
  • an order change, an order cancellation,
  • a large withdrawal on a savings account,
  • the closure of an account,
  • a mouse click on a web site,
  • a missed loan payment,
  • a product or pallet movement in a distribution chain (detected via RFID tag),
  • a tweet,
  • a competitor announcement,
  • CCTV video on street corners,
  • Electrocadigram (EKG) monitors,
  • etc.

Whatever the events or streams of data, there are thousands or millions of these that can occur in business operations each second. While not all data are of business interest, many require some kind of responsive action to seize an opportunity or prevent a problem occurring or escalating. That response may need to be immediate and automatic in some cases or subject to human approval in others.

With stream processing, analysis of data needs to take place before the data is stored in a database or a file system. The velocity at which data is generated and the volumes of data involved in stream processing means that human analysis is often not feasible.

Analysis has to be automated using a variety of analytic methods, such as predictive and statistical models or acoustic analysis to determine or predict the business impact of these events.

Decision-making may also have to be automated to respond in a timely manner to keep the business optimised and on track to achieving its goals. Actions may vary from alerts to completely automated actions (e.g., invoke transactions or close a valve in an oil well).

In some industries the volume of event data can be significant.


Multi-structured data is often un-modelled and therefore requires exploratory analysis (often conducted by Data Scientists) to determine what subset of data is of value to the business.

Once done, any data identified as being of value can be extracted and put into data structures from where further analysis can take place and new business insight produced.

Popular sources of multi-structured data include web logs and external social network interaction data.

Analysing and extracting data from social networks currently dominates text analytical activity in customer-facing organisations.

Top business applications driving analysis of text as:

  • Brand/product/reputation management
  • Voice of the Customer
  • Search, Information access or question answering
  • Research
  • Competitive intelligence

Challenges with this type of data include:

  • it can be very large in volume and may contain content in different languages and formats
  • it might contain poor quality data (e.g. spelling errors or abbreviations) and obsolete content.

A key requirement for successful text analytics is to ‘clean’ the content before analysis takes place.

Pre-processing text before analysis involves:

  • extracting, parsing, correcting and detecting meaning from data (using annotators) to understand the context in which the text should be analysed.

Multi-structured data is also hard to analyse. For example, social interaction data can require multiple analytical passes to determine the insight needed:

  • The first pass involves text analysis (mining) to extract structured customer sentiment and also to extract social network ‘handles’   embedded in interaction text that represent members of a social graph
  • The second pass is to analyse the extracted data for negative and positive sentiment
  • The third pass loads the social network handles into a graph database where new advanced analytic algorithms (e.g. N-Path) can be used to navigate and analyse links to identify contacts, followers and relationships needed to piece together a social network and to identify influencers.

Predictive analytics, more sophisticated statistical analysis and new visualization tools might also be needed. Search based analytical tools that use search indexes on multi-structured data may also help with this type of workload.

Content analytics goes beyond text analytics in that it can also handle audio, video and graphics. Digital asset content e.g. sound and video is more difficult to parse and derive business value from because the content is not text.

Deriving insight from this kind of content is reliant on sophisticated analytic routines and how well the content has been tagged to describe what the content is and what it is about.

This big data analytical workload involves:

  • Obtaining the necessary un-modeled data
  • Cleaning the data
  • Exploring the data to identify value
  • Producing a model from the exploratory analysis (structure)
  • Interpreting or analysing the model to produce insight


This type of big data analytical workload may be on structured data taken from a data warehouse or from other data sources (e.g. operational transaction systems) for the specific purpose of doing complex analysis on that data.

Predictive and statistical models can be built for deployment in database or in real-time operations.

Some vertical industries are investing heavily in complex analysis to mitigate risk (e.g, Oil and Gas)


Big data systems are being looked at as an inexpensive alternative for storing archive data.


With so many new sources of data now available to business and data arriving faster than business can consume it, there is a need to push analytics down into ETL processing to automatically analyse un-modelled data.

The purpose of this is speed up the consumption of unmodelled data to enrich existing analytical systems. This improves agility and opens up the way for more timely production of business insight.



Stream processing software supports automatic analysis of data in-motion in real-time or near real-time.

It  identifies meaningful patterns in data streams and triggers action to respond to them as quickly as possible.

This software:

  • provides the ability to build realtime analytic applications that keeps different parts of a business operation optimized.
  • must be capable of automated analysis of event data streams containing either multi-structured data (e.g. Twitter streams or video streams) or structured data or both.

Predictive and/or statistical models deployed in real-time analytical workflows provide automated analytical capability in stream processing software. A rules engine is also needed to automate decision-making and action taking.

The software must cope with high velocity ‘event storms’ where events arrive out of sequence at very high rates. Also integration of multimodal data.


Apart from data-in-motion, data needs to be stored prior to analysis taking place.

There are multiple storage options available to support big data analytics on data at rest. These options include:

  • Analytical RDBMSs
  • Hadoop solutions
  • NoSQL DBMSs such as graph DBMSs

Analytical RDBMSs Appliances

  • relational DBMS systems that typically run on their own special purpose hardware specifically optimised for analytical processing.
  • often known as an appliance and is a workload-optimised system.

Hadoop Solutions

Apache Hadoop is an open source software stack designed to support data intensive distributed applications.

Component Description
Hadoop HDFS A distributed file system that partitions large files across multiple machines for high-throughput access to data
Hadoop MapReduce A programming framework for distributed batch processing of large data sets distributed across multiple servers
Chukwa A platform for distributed data (log) collection and analysis
Hive A data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into Map/Reduce programs
HBase An open-source, distributed, versioned, column-oriented store modelled after Google’s Bigtable
Pig A high-level data-flow language for expressing Map/Reduce programs for analyzing large HDFS distributed data sets
Mahout A scalable machine learning and data mining library
Oozie A workflow/coordination system to manage Apache Hadoop jobs
Zookeeper A high-performance coordination service for distributed applications

Hadoop complements RDMS technology, and is  suited to exploratory analysis of both structured and multi-structured data.

Typically, un-modelled data is stored in the Hadoop HDFS file system where exploratory analysis occurs to derive structure. This can then be stored in Hive for further analysis.

Data scientists develop batch analytic applications in languages like Java, Python and R to run in this environment using a style of programming known as MapReduce.

Programs can be copied to thousands of compute nodes where the data is located in order to run in parallel. In addition in-Hadoop analytics in Mahout can run in parallel close to the data to exploit the full power of a Hadoop cluster.

Hive is also available to SQL developers and/or tools to access data in Hadoop using the HiveQL language.


As well as Hadoop HDFS, HBase and Hive, there are other NoSQL DBMSs options available as an analytic data store.

These include key value stores, document DBMSs, columnar DBMSs, graph databases and XML DBMSs.

There are no standards in the NoSQL market as yet.

Which Storage Option Is Best?

The following table shows criteria that can be used as a guideline as to where data should be stored for a big data analytical workload.

Analytical RDBMS Hadoop / NoSQL DBMS
Data analysis and reporting or complex analysis Data exploration followed by analysis or a very specific type of analysis for which a NoSQL DBMS is designed to excel at e.g. graph analysis in a graph database
Data is well understood Data is NOT well understood
Schema is defined and known Schema is not defined and variant
Batch and on-line analysis Batch analysis with some on-line capability via Hive or Lucene
Access via BI tools that generate SQL and that can run predictive / statistical models in the database Development of MapReduce applications in Java, R, Python, Pig etc.
Scalable to hundreds of terabytes on purpose built MPP clusters Scalable to Petabytes on purpose built appliances or in the cloud

Looking at the workloads for big data at rest, the following table tries to match the each workload to the appropriate data storage platform.

Big Data Analytical Workload Big Data Storage Platform
Exploratory analysis of un-modelled multi-structured data e.g. web logs, unstructured content, filtered sensor data, email Hadoop
Complex analysis of structured data or for data warehouses that have ‘light’ mixed workloads Analytic RDBMS Appliance
Storage and re-processing of archived data Hadoop
Accelerating ETL processing of structured and un-modelled data Hybrid: Hadoop and Analytical DBMS
Social Graph Link analysis NoSQL Graph DBMS


Consistent high quality data across multiple analytical data stores is important. This includes Data Warehouse RDBMSs, Analytical RDBMS Appliances, Hadoop Clusters/Appliances and NoSQL DBMSs.

There are a number of options for data management that range from different data management tools for each different platform to a common suite of tools supplying data to all platforms.

Moving data between platforms as part of an analytical process includes:

  • Moving master data from an MDM system into a data warehouse, an analytical DBMS, or Hadoop
  • Moving derived structured data from Hive to a data warehouse
  • Moving filtered event data into Hadoop or an analytical RDBMS
  • Moving dimension data from a data warehouse to Hadoop
  • Moving social graph data from Hadoop to a graph database
  • Moving data from a graph database to a data warehouse

Architecting A Big Data Platform for Enterprises

To make this possible, information management software needs to:

  • Support ELT processing on Hadoop (multi-structured data) and/or analytical RDBMSs (structured data),
  • interface with event processing to ingest filtered event stream data,
  • load data into Hadoop and NoSQL DBMSs,
  • parse data in Hadoop,
  • clean data in Hadoop,
  • generate HiveQL, PIG or JAQL to process multi-structured data in Hive or Hadoop HDFS,
  • perform automated analysis on data in Hadoop
  • and finally to extract data from Hadoop and NoSQL DBMSs.

It must also support master data management.


Options include:

  • Custom Hadoop MapReduce batch analytic applications using ‘in-Hadoop’ custom or Mahout analytics
    • Examples would be text, clickstream data, images etc.
    • Likely to be an option for a Data Scientist involved in exploratory analysis and writing their own analytics in R for example or using the pre-built Mahout library of analytics from within their application
    • MapReduce based BI tools and applications that generate MapReduce applications
      • Pre-built analytic application solutions and new BI tools are available that generate MapReduce applications that exploit the parallelism in Hadoop to analyse multi-structured data such as large corpuses of content or customer interaction data.
      • In-Database analytics on analyticalDBMSs
        • The deployment of custom built or pre-built analytics within an Analytical RDBMS to analyse structured data.
        • This is an example of complex analytics on structured data.
        • Traditional BI tools analysing data in Hadoop Hive and Analytical RDBMS in addition to data warehouses and cubes
        • Search based BI tools on Hadoop and AnalyticalRDBMS
          • New search based BI tools are emerging to permit free form analysis of multi-structured and structured data in Hadoop and/or in data warehouse appliances. These tools can crawl structured data in analytical RDBMSs and also use MapReduce to build indexes on data in Hadoop. It then becomes possible to build analytic applications on top of these indexes to support free form exploratory analysis on multi-structured data and/or structured data. The tools may exploit the Hadoop Lucene search engine indexes or other indexes which themselves may be stored in Hadoop.
          • In-flight analytics of data-in-motion in event data streams



This new environment is also known as the ‘enterprise analytical ecosystem’ or ‘logical data warehouse

  • Event processing of data in-motion can be done on sensor data, or any other event data source
  • When variations in event data occur, event processing software analyses the business impact and can take action if required.
  • Filtered events can be loaded by information management software into Hadoop for subsequent historical analysis.
  • Using batch map/reduce analytical processing, discovered insight can be fed into a data warehouse.
  • un-modelled multistructured data can be loaded directly into Hadoop using information management software
  • data scientists can conduct exploratory analysis using custom map/reduce applications, or map/reduce tools that generate HiveQL, Pig or JAQL.
  • Alternatively search-based BI toolscan be used to analyse the data using indexesbuilt in Hadoop with map/reduce utilities.
    • (E.g. if the multi-structured data is Twitter data, then Twitter handles could be extracted and moved (by Information management software) from Hadoop into a NoSQL graph database for further social network link analysis.)

Complex analysis of structured data is undertaken on analytical DBMS appliances using in-database analytics.

Data virtualisation software presents data as if it is available in a single data store

Bigdata environment


One of the major strengths of Information management software is its ability to define data quality and data integration transforms in graphical workflows.

These workflows can be built and re-used regularly for common analytical processing of both structured and un-modelled multistructured data to speed up the rate at which organisations can consume, analyse and act on data.


  • Support for multiple analytical data stores including:
    • Apache Hadoop or a commercial distribution of Hadoop for cost effective storage and indexing of unstructured data
    • MPP Analytical DBMS offering pre-built in-database analytics and the ability to run custom built analytics written in various languages (e.g. R) in parallel
    • A data warehouse RDBMS to support business-oriented, repeatable analysis and reporting
    • A graph DBMS
    • Support for stream processing to analyse data in motion
    • Information management tool suite support for loading Hadoop HDFS or Hive, graph DBMS, analytical RDBMS, data warehouse and master data management
    • Ability for the information management tool suite to generate HiveQL, Pig or JAQL to exploit the power of Hadoop processing
    • Integration between stream processing and information management tools to take filtered event data and store it in Hadoop or an analytical RDBMS for further analysis
    • Support for seamless data flows across multiple SQL and NoSQL data stores
    • Data virtualisation to hide complexity of multiple analytical data stores
    • Query re-direction to run analytical queries on the analytical system best suited to the analysis required i.e., Data Warehouse, analytical RDBMS, Hadoop platform, event stream processing engine etc.
    • Ability to develop predictive and statistical models and deploy them in one or more workload optimised systems as well as in a data warehouse e.g. in a Hadoop system, an analytical RDBMS and event stream processing workflows for real-time predictive analytics
    • Ability to run in-database and in-Hadoop analytics in information management workflows for automated analysis during data transformation and data movement
    • Integration between information management workflows and a rules engine to support automated decisions during workflow execution
    • Nested workflows to support multi-pass analytical query processing
    • Exploitation of Hadoop parallel clusters during ETL processing
    • Ability to create sandboxes for exploratory analysis across one or more underlying analytical data stores
    • Ability to support data science project management and collaboration among data scientists working in a sandbox
    • Ability to source, load and prepare the data for exploratory analysis of huge corpuses of content in a MPP sandbox environment by associating data sources and information management workflows with a sandbox
    • Ability to integrate third party analytical tools into a specific sandbox
    • Ability to control access to sandboxes
    • Tools to develop MapReduce applications that can be deployed on
    • Hadoop platforms, graph databases or analytical RDBMSs as SQL MR functions
    • Parallel execution of text analytics in Hadoop clusters or in real time stream processing clusters
    • Ability to build search indexes across all analytical data stores in the new enterprise analytical ecosystem to facilitate search based analysis of structured and multi-structured data
    • Ability to connect traditional BI tools to Hadoop via Hive
    • End-to-end single console systems management across the entire analytical ecosystem
    • End-to-end workload management across the entire analytical ecosystem



  • Big data projects need to be aligned to business strategy
  • Identify candidate Big data projects and prioritise them based on business benefit


  • Match the analytical workload with the analytical platform best suited for the job


  • Data Scientists are new people that need to be recruited
  • Data Scientists are self motivated analytically inquisitive people with a strong mathematical background and a thirst for data
  • Traditional ETL developers and business analysts need to broaden their skills to embrace big data platforms as well as data warehouses


  • Governed sandboxes are needed by data scientists wishing to conduct investigative analysis on big data


  • Event stream processing and Hadoop based analytics are often upstream from data warehouses
  • Use big data Insights to enrich data in a data warehouse


  • Technologies need to be added to, and integrated with, traditional data warehouse  environments to create a new enterprise analytical exosystem that caters for all analytical workloads

  • Big Data: Most Annoying Buzzword Of The Year (
  • The Ecommerce Guide to Big Data [Infographic] (
  • Big Data: The ABCs of Analytics (
  • Big Data Infographic | How Big is Big Data? | Domo | Blog (
  • Why data scientists are in demand and how they enable big data (
    1. 06/02/2013

    Add Your Comment