Viewing entries in
Big Data

1 Comment

The Great Mixing Bowl of Big Data

Big Data is often described by its attributes – volume, velocity, & veracity.  But less often is it described by its source or by the value of some pre-analytic steps.  That is our focus here. 

First, the Sources 

Enterprise Data.  Most organizations are attuned to the data inside their organizations.  They know they have a lot of it; it’s varied and growing all the time.  Less often do they know any exactitude about this enterprise data.  Less often still, what to do with it. 

mixed data

Public Data.  More mysterious still is Public Data.  There is so much of it (e.g. in the form of the internet) and it is not well formed.  The information there is mostly written and so requires special handling called Natural Language Processing (NLP) to get anything useful out of it. 

Social Media.  Ah, then there is Social Media.  A bit clearer, of late, to most decision-makers in terms of its value but no less problematic than Internet data in terms of handling text.  And oh, those pesky hash tags, and shorthand nomenclature present some unique challenges.

Sensor Data.  The (coming) Internet of Things (IoT) is all about the data embedded sensors are throwing off data from your refrigerator, thermostat, car, wristwatch and far more.  A good deal of these sensors you will never know of, some more you will be asked permission to switch on, others still will be switched on by default and you will be asked if you want to turn it off.   This last option, by the way, is likely to done in a way it is hoped you will miss or ignore.   

Transactions.  Here is a real treasure trove of clicks, timing, movement, card swipes, prices, buying, selling, and so much more.  When a company or person makes a transaction a computer somewhere is likely to be writing it to a log file or database.

Imputed Data.  Often overlooked, this type of data is created rather than collected based on comparisons or logic from the other data categories above.  For example, a purchase transaction at a retail location is recorded with a time-stamp of 6:04 pm.  The imputed data created here might by “night purchase” as opposed to “day purchase” and recorded as a new data input.

Ready, and Mix

Many organizations do not adequately explore the value of mixing data from these various sources.  It becomes a bit of a sub-specialty to hunt down and creatively find connection points where one type of data can be folded into another and still maintain integrity.  


But doing so has important benefits;

Data Completion.  We don’t know of any organizations that are satisfied with the completeness of a single type of data sources.  There are always holes.  Many times, it is another source of data that can fill in the gaps and allow previously planned work to proceed.  A corollary to this is finding analogous data.  When the data you want doesn’t exist and there isn’t the right data to impute what you need you can sometimes find analogous data that fills the gap and analytically is valid.  This often comes from other sources.  

Enriching Models.  From an analytical point of view mixing data types makes models of markets, companies or consumers “holistic”.  If transaction data tells me about a consumers purchase, social media tells me about them personally.  Both data types enrich each other for a more complete picture of the individual. 

Further Imputing.  Once mixed together 2 or more data types may reveal new data that can by imputed.  When John knows Mary, and Bob knows Mary, then it is likely John knows Bob, though you may not have direct data evidence of it, this is a reasonable and valuable conclusion.     

Like any good recipe, all ingredients together enhance the whole.  Do the same with a variety of data, exploring, mixing, imputing, mixing some more – all with pleasure and for profit. 

1 Comment


Big Data is Like Planet Hunting

20 years ago no one in astronomy thought detection of other planets was possible or if they even existed outside our solar system (mostly because as a scientist you cannot conclude something you cannot observe).  But along came better, more powerful telescopes of a wide variety.  Then smart scientists coupled novel analytical techniques to the increased amount of data collected by the telescopes.  Namely, they looked for the faint gravitational “wobble” a planet would put on the light we can observe from distance stars as they pass in front of their sun.  Brilliant.

planet hunting

To be sure, we are not observing the planet itself, even though they were there all the time.  The only thing we are actually observing is the data exhaust of a planet-sun interaction system to infer the planet exists. 

Big data is like that.  We improved data collectors in our more immediate world. These come in the form of log files, cell phone location data, click streams, and embedded machines sensors everywhere you look.  We also made a good deal of that data interoperable with APIs (Application Programming Interfaces) so we could collect and meld ever-large sets of data.  To this is added ever more sophisticated machine learning techniques to trick more understanding out of the collected data.  Just like the planet hunters.

Daring an over-simplification lets see the difference between the old data way of modeling and predicting systems and the new big data way, that again is not unlike planet hunters from 20 years ago.  Predictive analytics of old relied on collecting as many observable inputs and outputs to a system to model it.  The trouble was always the lack of data because no one bothered to collect it or it was proprietary and so protected from prying eyes.  It worked but wasn’t always pretty and limiting.  Just like the planet hunters who lacked data or were frustrated by being earth-bound and having observations obscured by the atmosphere.

old data analysis

But along the way someone noticed the increasing amount of observable “exhaust” from systems and wondered if there was enough efficacy in the data to conclude something useful.  Sure enough there was.  And a corner was turned.

big data analysis

Exhaust from a system generally can’t be hidden.  It is free for all to see and use and understand the inner workings of a system.  And once understood better policies, stronger competition, and faster innovation are the result.  This is the benefit of Big Data Analysis writ large.   Likewise, someday soon, the planet hunters will devise and detect one of those planets they found blinking a signal that hopefully says, “Welcome – lets talk!”