Here is why Spark is taking over

Here is a nice post about the advantages of Apache Spark over more traditional Hadoop.  They are numerous and at Big Data Lens we are adopting this important technology into our work. 

While you are at it be sure to check out the guys at databricks.  The founders helped invite Spark so they know what they are doing in simplifying and managing spin up / down of Spark clusters and then tacking on a nice GUI for developing your scripts and machine learning models.  Do yourself a favor and invest 20 minutes of your time watching this video to see how it all ties together.  



Stop Hiring Data Scientists If You Are Not Ready to Do Data Science

Here is an excellent post about testing your readiness, or more appropriately willingness to do true data science.  Our two cents includes making sure "the other half" of data science includes a clear understanding of predictive analytics, machine learning and for good measure probability.  Otherwise your data science will end up being a dressed up business intelligence (BI) effort (e.g. reporting historical facts) that scales better and has a fancy new name. 



The Intelligence Cycle in a Big Data Era

The intelligence cycle is a formal step-wise process employed since roughly World War II by government agencies to advise and warn high-level decision makers with foreknowledge of events so they can make advantageous decisions.  Later, adaption of the same cycle has been employed by business to know their competitors in legal and ethical ways.  But in an era of Big Data this cycle is changing.  Here is how. 

The traditional intelligence cycle uses 5 steps as shown in the diagram below.  Two key areas are easy to overlook. 

The intelligence cycle

The intelligence cycle

First is the need.  Resources are not engaged or deployed until and unless they are necessary.  They are expensive, prone to risk and if lost or discovered cause damage.  Likewise in the business equivalent intelligence cycle, the need drives a decision.  When money, time and/or strategic advantage are at stake companies use the intelligence cycle.

Planning & Direction and Collection are also critical steps.  These are about breaking the problem down into analytical and collection sub-components.  Doing so focuses the problem as well as sets it up for analytical success. 

Intelligence is about a preponderance of evidence.  When the problem set is logically and thoroughly broken down the chances to see such preponderance is available to analysts.  Analysts can also stress test what they believe by removing some evidence and reaching the same conclusion or not.

Big Data practice can both learn from this cycle as well as be bound by it.  The learning comes from the clarity and proof of method the intelligence cycle provides as a test over time.  Big Data efforts often get bogged down without a priori goals, over collection of information, and over-fitting of data models.  But Big Data efforts are improved when boundaries to the problem are set, reasonable choices about data and analytical techniques are tied to goals and an unwavering focus on decisions is adhered to. 

On the other hand, some of the promise of Big Data lies in the discovery of new things not previously seen or imagined.  It is helpful to chart this out.  There is a body of knowledge about the world.  Yet there is much yet to learn about how the world works.  Our personal or corporate body of knowledge is a subset of the world.

knowledge map

knowledge map

When our knowledge matches that of the world we have certainty.  But there are two classes of uncertainty.  One is when we know we don’t know something.  This becomes a call for effort to acquire the knowledge.  The other is that don’t know we already know something.  This is correctable through discovery.

The last box in the diagram represents pure uncertainty.  In the famous words of Donald Rumsfeld these are the things we don’t know that we don’t know. 

Much information technology in the last 30 years has been devoted to moving to correct uncertainty through effort – to acquire knowledge we knew we did not have but that existed.  Big Data, through machine learning and predictive analytics can be thought of as the discovery type of correctable uncertainty.  We need the volume and speed of information processing only available to us with computer processing power to enact this discovery. 

The larger picture still is to use both effort and discovery to squeeze the pure uncertainty box ever smaller.  As we do so the certainty box grows larger as all other boxes shrink. 

The intelligence cycle in government or business is impacted by Big Data in the same way.  Uncertainty in any form can become an intelligence need in and of itself, where the volume of data and the analytical horsepower may have previously precluded it.  Likewise, Big Data predictive analytics now opens new decision-making avenues not previously considered or seen.  Finally, good intelligence practice always included after-action evaluation and feedback.  Now, the practice of Big Data provides it’s own machine learning based feedback and correction, delivering insight and subtleties previously missed. 

Whether for national security or for optimal corporate resource allocation, Big Data is impacting intelligence creation for the better.  As the 2010 report to the Congress and the President succinctly stated;

“Data volumes are growing exponentially.  There are many reasons for this growth, including the creation of all data in digital form, a proliferation of sensors and new data sources such as high-resolution imagery and video.  The collection, management and analysis of data is a fast growing concern.  Automated analysis techniques such as data mining and machine learning facilitate the transformation of data into knowledge and knowledge into action.  Every federal agency [and business] needs to have a ‘big data’ strategy.” 



44 Influential Big Data Articles

Seems Data Science Central (DSC), as the record of Data Science best practice also picked our article on using Natural Language Processing to find connections in Big Data as one of the 44 most influential articles.  Nice correct attribution for us this time. 

Moreover - look through the rest of these articles.  Plenty of good ideas, links to sample code, books, and more.  And if you are not signed up with DSC do so for the daily / weekly digest right in your email box.



21 Great Big Data Graphs

We made the list on Data Science Central's "21 Great Big Data Graphs" for their communication power (being able to quickly tell a powerful message with a simple visual), rather than out of artistic qualities. Ours is the Patent Analysis pic - though not properly attributed to us - Big Data Lens does this all the time. 

Simple 2 dimensional sorting and grouping can be a powerful way to see connections for similarities and differences.  In the example given (though the data labels are not real to protect the client) the effort looks at where patten claims are the same (where the same items cluster) between two companies vs where they are different (where the items do not cluster).

This kind of visualization gives insight even before any machine learning begins.  In fact it can help set the stage for the type or analysis to conduct, the variables to include or not include, generate ideas for additional data collection and so on.  In this sense the analytical work becomes much more focused and takes less time. 

Visualizations are about telling the story.  But they are also about good data science - before, in the middle and at the end of the process. 



Use Predictive Analytics to influence the Zero Moment of Truth (ZMOT)

Now there is a mouthful.  The "Zero Moment of Truth".  Or ZMOT.  It all started with Consumer Products giant Procter & Gamble who coined the term "First Moment of Truth" born of research they did suggesting shoppers make up their minds on a purchase within 3-7 seconds of looking at brands on a store shelf.  Not a lot of time to make an impression.  So better make it right. 

Not to be outdone, the big brainy guys at Google went one to the left in digital terms to come up with what happens even before you get to the store.  That is the ZMOT.  Turns out this important stuff.  Once an impression is formed consumers stick with it.  They are harder to get to change their mind then to get them to make up their mind in the first place.  So Zero is higher than First. 

Back to the geeky stuff and you discover the importance of good Big Data and Predictive Analytics practice.  Without it there is no ZMOT.  Or more succinctly, no accurate impression based on consumer's needs, desires, wants, etc.  Knowing as much as you can about the customer before choosing the message to put in front of them has power and power that lasts.  

If you make consumer products of any type your brand, it's image and what it portrays is your life blood.  Know how to make sure each opportunity to influence the shopper is used wisely. 



How to Spot Trends and the Value of Big Data

As this article suggests even the guys who create lots of content need helps "seeing" the big picture.  Kudos to the New York Times for making the investment in a tough publishing environment to enhance their products and services for the sake of their readers.

It is also a fine example of what Big Data boils down to.  Get past all the techno-speak and ask what big data does.  And how to make it specific for your staff.  Both of these are ultimately the measure of the value of Big Data and Predictive Analytics. 

To spot some trends of you own be sure to try the Big Data Lens demo here -->




Using NLP to discover connections in Big Data

We've been eating our own dog food lately.  By that we mean using our demonstration tool located here --> where you can search over a broad (but limited set) of internet based news and have the search results further processed to discover connections.  So we've been using this to monitor developments in the Ebola outbreak in West Africa.  Below are just a couple of the ways you see how Natural Language Processing (NLP) of internet content helps find ties inside and between documents. 

While looking at these think about your the connections you might discover for your area of interest.  Think about the connections you might discover between internal AND external documents.  Think about the connections you might discover between English based documents and documents in another language.  And so on. 

Big Data is about getting up, across and around massive amounts of information, to see and know the patterns that make our world work where these things may have been hidden before.  We think our statement about the power of Big Data and Predictive Analytics says it best "Big Data Predictive Analytics Algorithms tell you what consumers will do, where markets will be or how technology will develop. The outcome boosts sales, improves operations, reduces risk and forms better plans."  Use it to your advantage.

Here are the ties between all 7936 documents processed in < 9 seconds. 

Here are the ties between all 7936 documents processed in < 9 seconds. 

Holding your cursor over one entity clears the graph to show what else it is tied to.&nbsp; Here the connection is between the CDC (Centers for Disease Control) in Atlanta and African Countries.&nbsp; This makes sense given the search for Ebola that started the process.

Holding your cursor over one entity clears the graph to show what else it is tied to.  Here the connection is between the CDC (Centers for Disease Control) in Atlanta and African Countries.  This makes sense given the search for Ebola that started the process.

Here we can see connections between how to combat the disease with ethical treatment (meaning professional medical and scientifically proven treatment), the administration of drugs and the President of Liberia (Ellen Johnson-Sirleaf).&nbsp; The interpretation is that the highest government official in at least one Ebola effected country is promoting the correct health treatment to combat the outbreak.

Here we can see connections between how to combat the disease with ethical treatment (meaning professional medical and scientifically proven treatment), the administration of drugs and the President of Liberia (Ellen Johnson-Sirleaf).  The interpretation is that the highest government official in at least one Ebola effected country is promoting the correct health treatment to combat the outbreak.

And here is a surprise.&nbsp; What is the connection between Morocco and the Seychelles - other than they are both African countries, though many thousands of miles apart and not yet affected by Ebola?&nbsp; Digging deeper you discover they have banned travel for their sports teams because of the Ebola outbreak.&nbsp; So while the outbreak is tragic in the lives it destroys there are also economic, cultural and entertainment costs not typically accounted for.&nbsp; A connection worth exploring further.

And here is a surprise.  What is the connection between Morocco and the Seychelles - other than they are both African countries, though many thousands of miles apart and not yet affected by Ebola?  Digging deeper you discover they have banned travel for their sports teams because of the Ebola outbreak.  So while the outbreak is tragic in the lives it destroys there are also economic, cultural and entertainment costs not typically accounted for.  A connection worth exploring further.


1 Comment

The Role of NLP in Big Data

We were asked by the head of the LinkedIn group named Predictive Analytics, Big Data, Business Analytics, Data Science in Oil and Gas and Energy for an interview on how Natural Language Processing (NLP) plays a role in Big Data.  The questions were great and so thought to reproduce the interview transcript here.  This is with Francisco Sanchez, Data Scientist from Swift Energy. 

In this month's question, we would like to focus our attention on Natural Language Processing (NLP) and the uses in the Energy industry. The energy industry is inundated with Land contracts, PSC's, legal files, etc. mainly unstructured data. The human language and interaction with the relationships it has with other variables is a very tough nut to crack. Here to shed light on this is Mr. J. Brooke Aker (, Chief Data Scientist at Big Data Lens.

Q: For those not too familiar with NLP, could describe what it is and its applications? Is it what Watson uses?

Sure thing.  NLP is the science of getting a machine to read and understand words for context. For example, take the sentence “My jaguar eats gas when I step on the gas”.  Lots of keyword-based technology (including Google) cannot distinguish the first gas (as in gasoline) and the second gas (as in gas peddle).  It may also confuse jaguar as a cat rather than a car.  So NLP adds a layer of processing to assign meaning to words rather than treat than as simple tokens (e.g. strings of letters). 

NLP also adds a layer of logic between words so you can understand actions, directions and connections between objects, as well as a categorization scheme. 

After 30-some years practitioners agree, the best NLP approach mimics the way we were taught to read and understand as children.  That is what we have done at Big Data Lens.

And while I don’t know the exact way Watson works the general approach is likely the same.  Note however, that Watson works because the domain it was operating in is constrained which will always boost your chances of getting good resolution over unstructured data. 

Q: Is it a possible that Earnings Release webcasts and scripts, Google Finance, company's own website, or their SEC filings and the way they word these items could be affecting their stock price?

Oh Yes.  In the early days of sentiment analysis (a sub-field of NLP), even with only crude positive or negative detection algorithms, you could easily prove that such news had about a 24 hour lag time before impact on stock prices.  The lag was the time it took for good or bad news to filter through to all those who had their fingers on their stock trade buttons. 

Today, with social media, I am sure the lag time is shorter.  But along the way we also discovered simple positive / negative algorithms were not very good.  Suppose a tweet said “The features on that lousy BP fracking drill are actually really good”.  So what is bad and what is good.  Even for some human readers it’s tough to untangle.  Sophisticated NLP will be able to know its BP that is lousy and the drill features that are good.     

Q: The Energy industry is engulfed with legal papers, contracts, written logs, etc. how can NLP put these together and find insight?

When you have good NLP technology you can begin to connect logic between sentences, then paragraphs, whole documents and finally between documents. 

Then one of the most important innovations is the ability to conduct inference across all documents.  This is typically done with a graph database.  This is a different kind of database since it records relationships and not just items and attributes.  With it you can infer things.  For example, if John knows Mary and Tom knows Mary we can infer John knows Tom as well.   The more instances of each the stronger the inference is.   

Q: Could it be possible to keep tabs on what your competitors are saying or posting in social media such as Facebook, Twitter, or other web postings and predict items such as Divesting or Acquiring companies, lower/higher earnings, or anything else that could be useful or is this cynical? 

Not at all.  This is a very good use of NLP but also requires some predictive modeling. For example, you might look backwards in time and see that Exxon-Mobil nearly always forms a partnership with a company, then later upgrades that to a joint venture, and finally markets and sells a product together prior to acquiring the company. 

This pattern of behavior relies on the NLP to find enough evidence across the web and then machine learning is used to model it.  This becomes a predictive analytics algorithm you simply feed a constant stream of news and social media too and then graphical show the probability of a new M&A event rising as the pattern of behavior evidence piles up. 

Q: What do you think the future will hold for NLP? 

We are working all the time to advance the state of NLP, machine learning, modeling and predictive analytics.  Using this combination of technologies you can use many years of internal company information and turn it into predictive algorithms. 

For example, input the many years of drilling reports from the field, almost all of which are written long hand.  But they represent the collective knowledge of field experience.  Using the above tools you could turn this into a real time guidance system as pressure, temperature, viscosity, salinity readings indicate to you all is clear or there is trouble upcoming the deeper you drill.  Like having hundreds of long ago retired drill chiefs looking over your shoulder providing a life-time of experience so disaster is avoided.

There is also the “Internet of Things” (IoT) movement which is all about a huge increase in the number of small sensor located on pipelines, drills, engines, refineries, etc. reporting a constant stream of data.  Using Big Data techniques on this data is also about predicting and heading off trouble or improving operations.  There is less NLP required in this scenario but plenty of machine learning and predictions.  

Q: Could you provide some examples of what your company has done in the Energy industry?

We have applied the drilling report example I described above.  We also have used similar techniques to qualify vendors and track that they are not headed for some kind of internal legal or financial trouble that would cause risk downstream for our customer.  Same for monitoring and predicting the outcome of regulations at the state and federal level.  Or even using these techniques to compare and contrast vast numbers of patents between yourself and competitor to manage your IP portfolio. 

And so on.  Once you get the basic recipe down and have good technology you can apply this to all kinds of business issues with good success.  

Q: Anything else you would like to add.

Only that an aspect of this kind of work you don’t always see talked about is creativity.  Often companies we work for have a desire to execute on some predictive work as described above.  They hand us data to work with and we find as often as not that the data is thin and not sufficient to prove anything.  So we have to put on our creativity hat to find public data, or proxy data we can use to supplement the sample data.  We have gotten very good at this and it is a different kind of skill than the NLP or statistical manipulation inherent in machine learning.  Be sure to ask any vendor you might use if they are good and creative data sleuths. 


1 Comment

1 Comment

The Great Mixing Bowl of Big Data

Big Data is often described by its attributes – volume, velocity, & veracity.  But less often is it described by its source or by the value of some pre-analytic steps.  That is our focus here. 

First, the Sources 

Enterprise Data.  Most organizations are attuned to the data inside their organizations.  They know they have a lot of it; it’s varied and growing all the time.  Less often do they know any exactitude about this enterprise data.  Less often still, what to do with it. 

mixed data

Public Data.  More mysterious still is Public Data.  There is so much of it (e.g. in the form of the internet) and it is not well formed.  The information there is mostly written and so requires special handling called Natural Language Processing (NLP) to get anything useful out of it. 

Social Media.  Ah, then there is Social Media.  A bit clearer, of late, to most decision-makers in terms of its value but no less problematic than Internet data in terms of handling text.  And oh, those pesky hash tags, and shorthand nomenclature present some unique challenges.

Sensor Data.  The (coming) Internet of Things (IoT) is all about the data embedded sensors are throwing off data from your refrigerator, thermostat, car, wristwatch and far more.  A good deal of these sensors you will never know of, some more you will be asked permission to switch on, others still will be switched on by default and you will be asked if you want to turn it off.   This last option, by the way, is likely to done in a way it is hoped you will miss or ignore.   

Transactions.  Here is a real treasure trove of clicks, timing, movement, card swipes, prices, buying, selling, and so much more.  When a company or person makes a transaction a computer somewhere is likely to be writing it to a log file or database.

Imputed Data.  Often overlooked, this type of data is created rather than collected based on comparisons or logic from the other data categories above.  For example, a purchase transaction at a retail location is recorded with a time-stamp of 6:04 pm.  The imputed data created here might by “night purchase” as opposed to “day purchase” and recorded as a new data input.

Ready, and Mix

Many organizations do not adequately explore the value of mixing data from these various sources.  It becomes a bit of a sub-specialty to hunt down and creatively find connection points where one type of data can be folded into another and still maintain integrity.  


But doing so has important benefits;

Data Completion.  We don’t know of any organizations that are satisfied with the completeness of a single type of data sources.  There are always holes.  Many times, it is another source of data that can fill in the gaps and allow previously planned work to proceed.  A corollary to this is finding analogous data.  When the data you want doesn’t exist and there isn’t the right data to impute what you need you can sometimes find analogous data that fills the gap and analytically is valid.  This often comes from other sources.  

Enriching Models.  From an analytical point of view mixing data types makes models of markets, companies or consumers “holistic”.  If transaction data tells me about a consumers purchase, social media tells me about them personally.  Both data types enrich each other for a more complete picture of the individual. 

Further Imputing.  Once mixed together 2 or more data types may reveal new data that can by imputed.  When John knows Mary, and Bob knows Mary, then it is likely John knows Bob, though you may not have direct data evidence of it, this is a reasonable and valuable conclusion.     

Like any good recipe, all ingredients together enhance the whole.  Do the same with a variety of data, exploring, mixing, imputing, mixing some more – all with pleasure and for profit. 

1 Comment


Big Data is Like Planet Hunting

20 years ago no one in astronomy thought detection of other planets was possible or if they even existed outside our solar system (mostly because as a scientist you cannot conclude something you cannot observe).  But along came better, more powerful telescopes of a wide variety.  Then smart scientists coupled novel analytical techniques to the increased amount of data collected by the telescopes.  Namely, they looked for the faint gravitational “wobble” a planet would put on the light we can observe from distance stars as they pass in front of their sun.  Brilliant.

planet hunting

To be sure, we are not observing the planet itself, even though they were there all the time.  The only thing we are actually observing is the data exhaust of a planet-sun interaction system to infer the planet exists. 

Big data is like that.  We improved data collectors in our more immediate world. These come in the form of log files, cell phone location data, click streams, and embedded machines sensors everywhere you look.  We also made a good deal of that data interoperable with APIs (Application Programming Interfaces) so we could collect and meld ever-large sets of data.  To this is added ever more sophisticated machine learning techniques to trick more understanding out of the collected data.  Just like the planet hunters.

Daring an over-simplification lets see the difference between the old data way of modeling and predicting systems and the new big data way, that again is not unlike planet hunters from 20 years ago.  Predictive analytics of old relied on collecting as many observable inputs and outputs to a system to model it.  The trouble was always the lack of data because no one bothered to collect it or it was proprietary and so protected from prying eyes.  It worked but wasn’t always pretty and limiting.  Just like the planet hunters who lacked data or were frustrated by being earth-bound and having observations obscured by the atmosphere.

old data analysis

But along the way someone noticed the increasing amount of observable “exhaust” from systems and wondered if there was enough efficacy in the data to conclude something useful.  Sure enough there was.  And a corner was turned.

big data analysis

Exhaust from a system generally can’t be hidden.  It is free for all to see and use and understand the inner workings of a system.  And once understood better policies, stronger competition, and faster innovation are the result.  This is the benefit of Big Data Analysis writ large.   Likewise, someday soon, the planet hunters will devise and detect one of those planets they found blinking a signal that hopefully says, “Welcome – lets talk!”



Old Hat - Traditional Relational Databases and Hadoop/MapReduce

I attended the Federal Big Data Working Group Meetup last Monday night.  They had 3 speakers from MIT and their Big Data / Data Base Group speaking.  Prof(s). Sam Madden and Mike Stonebraker are excellent at laying out the case, the architecture and uses for Big Data and Predictive Analytics. 

Prof. Stonebraker is now well known for his controversial views on the "Big Elephant" database vendors like Oracle and IBM.  He contends they are legacy, row based systems running on millions of lines of 30 year old code, and therefore very very slow.  He should know since he has helped build every successive wave of database systems since the beginning.  He is a big proponent of column based systems and now more recently of array based systems - like SciDB (which we recommend). 

He also believes the time has past for Hadoop and MapReduce as the central architectural components to Big Data practice. 

I did not record his talk but found a slightly older version that looks like just about the same thing here.




First Mover Advantage in your Predictive Model

The New York Times posted this useful article yesterday about the power and peril of predictive analytics.  The author is not quite right when he states "If Netflix can predict which movies I like, surely they can use the same analytics to create better TV shows. But it doesn’t work that way. Being good at prediction often does not mean being better at creation." 

Any good predictive algorithm can find the features of movies I like as part of a recommendation engine.  And yes it cannot create the next movie I will like but it surely can inform writers about making sure the features I like are in the next movie they create. 

And the author is certainly not right in his contention "For example, we may find the number of employees formatting their résumés is a good predictor of a company’s bankruptcy. But stopping this behavior hardly seems like a fruitful strategy for fending off creditors."  Here he assumes understanding the outcome of a system is the same as changing it.  Not anymore so than when your doctor tells you to take aspirin for the headache you have due to a virus.  The aspiring relieves a symptom but not the disease. 

But the author is dead on with this statement "Rarity and novelty often contribute to interestingness — or at the least to drawing attention. But once an algorithm finds those things that draw attention and starts exploiting them, their value erodes. When few people do something, it catches the eye; when everyone does it, it is ho-hum." 

In short, if you are the first to discover signal from what looked like noise then use it to your advantage.  But recognize that once everyone else notices how well you are doing by manipulating the system they will copy you.  The entire system will change and your advantage will disappear. 

But all is not lost.  You simply have to build your model again and find a new first mover advantage.  Don't rest on your laurels. 




On Trusting Big Data(2) and Being “Predicate Free”

There seems to be a divide between Big Data sources when it comes to trust.  Big Data efforts based on internal sources (e.g. log data, transactions, etc.) are trusted.  But apparently Big Data efforts based on external sources in the form of social media etc. is not.  See this IBM study for more details.

But this may be due to one of two things.  The first is simple experience with the external data. Yes it comes from the wild and wholly world of the internet but like the internal data that does not mean there isn't gold in there. 

Which leads to the second point - technique.  There is a wild array of techniques you can use to analyze your Big Data.  Finding the right size, shape and analytical technique in order to find that gold is not easy.  It takes trial and error and great skill in interpreting the results.  

Like all marketing efforts, Big Data is now suffering from "the gloss over effect".  We are given the impression that Big Data value is as simple as hooking a pipe of Big Data into a magic analytical box and out pops the gold you were told is hiding there.  Sorry - not that easy.  

Which is where the Predicate Free expression comes in.  Big Data is sometimes associated with the idea that you can learn a lot from a system simply by examining it output.  “If there are fumes coming from a tailpipe it must be a car”, or “if you smell food when you are near a factory it must be a bakery” kind of thinking.  The work is Predicate Free meaning you don’t need to be bothered by hypothesis, testing and proving.  A system observed over time will always be the same system observed. 

This level of trust can end up being misplaced.  Lots of exogenous shocks to systems change them forever.  Many times those exogenous shocks have never been seen before and so are not accounted for in any Big Data model.  A re-estimate is in order at that point and being Predicate Free trips you up.      

So the realism that sets in after giving your Big Data this initial try might also contribute to the skepticism.  Big Data should be thought of more like a science project - embarking on a journey of discovery.  It takes many dead ends and blind alleys before you reach the promised land.  But it is there.    



Two Virtuous Models - The Power of Deduction and Induction

At a Big Data conference last summer I sat puzzled when listening to a speech about collecting and processing expense spreadsheets.  "How is this Big Data practice?", I wondered.  "Looks and smells like more traditional Business Intelligence to me", I thought.  There was no analysis, only reporting facts from a very small set of data relative to all the data floating around that large enterprise. 

It's bothered me every since.  We throw around the words 'Big Data' and slap it onto old practices to make them seem new and exciting.  This waters down Big Data and confuses those who are new to it.

But there is a proper way to understand the roles older Business Intelligence, statistical modeling, data warehousing efforts, etc. and true Big Data practices fit together. 

They fit together because Business Intelligence (and it’s cousins) is deductive or hypothesis based while Big Data is inductive or observational and each informs the other.  Here is how.

Deductive models work like the diagram below.  Start with a hypothesis about how a system, market, consumer or patient acts.  Then collect data to represent the stimulus.  Traditionally, the amount of data collected was small since it rarely already existed, had to be generated with surveys, or perhaps imputed through analogies.  Finally statistical methods established enough causality to arrive at enough truth to represent the system.  So deductive models are forward running – they end up representing a system heretofore not observed.


Inductive models work the opposite way.  They start by observing a system already in place and one that is putting out data as a by-product of its operation.  Like exhaust from a car tailpipe, many modern systems are digitally based and thus put out “data exhaust”. 

And thus Big Data is called Big since the collection of exhaust can be huge.  Your cell phone alone broadcasts application usage, physical movement, URL consumption, talk time, geographic location, speed, distance, etc.  It is the same with your computer, your car and soon your home. 


With inductive models you arrive at some level of system understanding but not truth since the data you have collected is the outcome of the system not the input.  On the other hand the deductive model is never the truth since it suffers from small and incomplete data.

Now put these two together to arrive at a virtuous support of one another.  Inductive models reveal possible dynamics in a system because of the size of the data and because that data is output based.  This can be feed back to the deductive system so it may be improved and thus getting closer to truth.  Likewise the improved deductive model points the way for new types of data to look for in and use the inductive model.  This is a virtuous cycle to be exploited.

Virtuous Cycle

This is often a disagreement in Big Data circles about one type of model over the other, or how Big Data spoils the decades of work of statisticians.  This need not be the case.  Both models can and should live to together.  They are both better for it. 



Of Football & Politics (in Brazil)

Brazil is the 5th biggest country in the world with 202 million people.  Its a fantastic place for food, culture, business and sport.  The World Cup of Football (soccer to us Americans) is running there now for a month.  The summer Olympics are next summer.  No rest for the weary in between, Presidential elections begin in earnest in July and end in October. 

Like the US, keeping tabs on the public mood when it comes to politics is a tough task.  So many people, a vast territory, and therefore many opinions.  Surveys are costly and take time.  By the time the traditional survey is complete it may be obsolete. 

Digital measure of mood, sentiment, desires, suggestions etc. however is done in real time.  This Big Data analysis from social media forms a supplement to traditional surveys.  But it can also be done from publications.  And then sorted by state, time or other dimensions. 

We pulled this together to keeps tabs on the presidential elections in Brazil.  The result an application that makes the allocation of electioneering resources quick and responsive.  A couple of the hour-by-hour mood trackers by parties and candidates are shown below.  

Now imagine doing the same to track how customers feel about your products, services, brands, and then being as fast a World Cup footballer in changing direction, get around a competitor and scoring a goal. 




What Big Data Patent Analysis Looks Like

Every company must seek to know something of the opportunities and threats that can impact their operations.  The risk of being caught by a surprise is just too great.  This includes understanding government regulation, technology advances and competitors.  It's these last two where patent analysis comes to the fore. 

A company reveals a lot about itself, it's technologies and plans for the future in terms of product or service in a patent.  If the company wishes to protect the Intellectual Property (IP) of its inventions they form and submit a patent application.  This becomes a matter of public record. 

Many large companies churn out hundreds of patents per year.  Collecting and comparing large groups of patents between companies becomes a daunting challenge when each can run to hundreds of pages of dense technical and legal prose. 

Below is a small movie showing a quick way to begin analyzing such a mountain of data.  It starts by using a specially tuned version of a Natural Language Processing engine to look for the claims, novelties, authors, prior art, etc. resident in each application.  The construction of a co-occurrence matrix of several dimensions lends itself to the modern equivalent of "descriptive statistics" about what is in the patents and how they compare between two companies.

To protect the confidentiality of customers the example in the movie uses false data

Note how simple sorting once the dimensions of interest are built reveal similarities and differences between patents, or patent groups.  This is vital to understanding areas of concentration from a competitive point of view and shows where technological investment occurs.  Similarities also point out where you will likely be in legal conflict in getting a patent approved.  Differences on the other hand show uniqueness and potential areas of advantage both legally and in the marketplace.

Beyond this simple examination more sophisticated patent modeling work can include areas like the following;

  • Patent Vectors: Analysis of claims, methods, prior art etc. are modeled into vectors or directional motion to help understand speed and aim of the technological advance of a company. 
  • Patent Licensing: Patents can represent up to 85% of a company’s value.  Use of the vectors and modeling other industry participants or even competitors can be used to identify new candidates for out-licensing, or cross licensing, or suppliers for in-licensing and open innovation.
  • Patent Litigation:  Patent co-occurrence matrices and vector models can help locate other patents that may invalidate prior art and avoid lengthy litigation.
  • M&A Due Diligence:  Vector models and graph database connection analysis of patents can establish chain-of-title to determine if the assets you are buying or selling are encumbered by unforeseen entanglements and are worth the price.
  • Competitive Intelligence:  These models look at competitors’ IP strategy, portfolio strength, strongest patents, filing trends, and trademarks. This gives a concise view of where opportunities and threats lie.
  • Innovation & Patentability:  Using the vectors to uncover areas of new and unique novelties is the goal in this kind of model.  In conjunction with the Competitive Intelligence from about, you form your own IP strategy.



Technology Readiness Level: Excellent Big Data Methodology

In the 1980's NASA developed an assessment tool for grading technologies and their readiness for application or market which they appropriately enough named, Technology Readiness Level (TRL).  See the reference on this method here.

At Big Data Lens we have pulled this technique into the modern Big Data era and embedded it into our core semantic processing platform.  There we use it to hunt down and score new technologies and products for a customers desired uses. 

In the old days the technology in question would go through an exhaustive technical review in labs, in the field and finally into composite reports.  Now we turn the entire internet on, grabbing and measuring patents, news reports, social media and a ton more stuff to arrive at a TRL level automatically.  This means scale.  So if you don't know who is making what and how advanced or applicable it is you can use the internet as a database to find out.




Why Predictive Analytics Go Wrong

You collected the sample data, you cleaned it, organized it and feed it to R or your favorite estimator.  You applied 23 different models and found the best fit.  You tested with new data and confirm your model choice with a review of the confusion matrix.  And viola your predictions are sound and strong.  Ready to go live.  Nice work.  

But a week later larger errors are creeping into your predictions.  Your customers are complaining.  What went wrong?  New conditions cropped up in the data - that's what went wrong.  In other words, the world changed but your model did not.  It's the bane of all good data scientists .... data is not static and dull, changes not observed when you estimated your model come into being and throw a monkey wrench into things. So how to deal with it?

Add probability!  You can think forward to the kind of events that may occur and include them in your algorithm by specifying them in terms of probability, timing and impact.  Then when estimating the model you can do so in a simulation fashion to see how, when and where one or more such events you worried might come true impact the predictions.  Because you have, in effect, practiced the future, your algorithms and thus predictions have become much more resilient. 

Adding this type of thinking and technique to your predictive analysis bag of tricks helps get you out of the quandary of constantly monitoring, re-estimating and re-deploying.  You'll be a stronger data scientist for it. 



Hand Written Notes in Electronic Health Records

Lots of folks are now in the game of slicing and dicing Electronic Health Records (EHR) or Electronic Medical Records (EMR).  There is much to be learned about diagnoses, outcomes, and populations and more when we take the fielded or structured portions of the EHR and apply Big Data techniques and Machine Learning models.   But what of the hand written notes doctors, PAs, and nurses make in the record? What can that tell us?  How does it influence our thinking about the diagnoses, outcomes and populations?

In short, those notes provide the "color" that fielded and rigid forms cannot.  The EHR wasn't designed to capture everything so sometimes something the medical professional sees it goes into the notes.  If the medical professional expresses "worry" or some other emotion in the notes they are in effect taking a first step towards predicting an outcome before it happens.  This is the analytical side of the medical professional brain at work - tallying all the indicators from the exam (or multiple exams and tests) into a written expression. 

Analyzing those written notes turns out to be a real goldmine for medicine.  Utilizing good Natural Language Processing (NLP) to know what those notes say, what they mean and then mixing those results with the fielded information on the rest of the EHR opens new avenues of understanding and prediction.  Heart disease is difficult to predict.  But recent work in mining both the notes and EHR suggest a prediction rate of > 80% for heart failure. 

What boxes are ticked and what choices from menus isn't the whole story.  What medical professionals write down does complete the story however.  NLP is the way to bring these two sets of information together.  Making medicine both preventative and curative is the result.