Big Data is often described by its attributes – volume, velocity, & veracity.  But less often is it described by its source or by the value of some pre-analytic steps.  That is our focus here. 

First, the Sources 

Enterprise Data.  Most organizations are attuned to the data inside their organizations.  They know they have a lot of it; it’s varied and growing all the time.  Less often do they know any exactitude about this enterprise data.  Less often still, what to do with it. 

mixed data

Public Data.  More mysterious still is Public Data.  There is so much of it (e.g. in the form of the internet) and it is not well formed.  The information there is mostly written and so requires special handling called Natural Language Processing (NLP) to get anything useful out of it. 

Social Media.  Ah, then there is Social Media.  A bit clearer, of late, to most decision-makers in terms of its value but no less problematic than Internet data in terms of handling text.  And oh, those pesky hash tags, and shorthand nomenclature present some unique challenges.

Sensor Data.  The (coming) Internet of Things (IoT) is all about the data embedded sensors are throwing off data from your refrigerator, thermostat, car, wristwatch and far more.  A good deal of these sensors you will never know of, some more you will be asked permission to switch on, others still will be switched on by default and you will be asked if you want to turn it off.   This last option, by the way, is likely to done in a way it is hoped you will miss or ignore.   

Transactions.  Here is a real treasure trove of clicks, timing, movement, card swipes, prices, buying, selling, and so much more.  When a company or person makes a transaction a computer somewhere is likely to be writing it to a log file or database.

Imputed Data.  Often overlooked, this type of data is created rather than collected based on comparisons or logic from the other data categories above.  For example, a purchase transaction at a retail location is recorded with a time-stamp of 6:04 pm.  The imputed data created here might by “night purchase” as opposed to “day purchase” and recorded as a new data input.

Ready, and Mix

Many organizations do not adequately explore the value of mixing data from these various sources.  It becomes a bit of a sub-specialty to hunt down and creatively find connection points where one type of data can be folded into another and still maintain integrity.  


But doing so has important benefits;

Data Completion.  We don’t know of any organizations that are satisfied with the completeness of a single type of data sources.  There are always holes.  Many times, it is another source of data that can fill in the gaps and allow previously planned work to proceed.  A corollary to this is finding analogous data.  When the data you want doesn’t exist and there isn’t the right data to impute what you need you can sometimes find analogous data that fills the gap and analytically is valid.  This often comes from other sources.  

Enriching Models.  From an analytical point of view mixing data types makes models of markets, companies or consumers “holistic”.  If transaction data tells me about a consumers purchase, social media tells me about them personally.  Both data types enrich each other for a more complete picture of the individual. 

Further Imputing.  Once mixed together 2 or more data types may reveal new data that can by imputed.  When John knows Mary, and Bob knows Mary, then it is likely John knows Bob, though you may not have direct data evidence of it, this is a reasonable and valuable conclusion.     

Like any good recipe, all ingredients together enhance the whole.  Do the same with a variety of data, exploring, mixing, imputing, mixing some more – all with pleasure and for profit. 

1 Comment