We were asked by the head of the LinkedIn group named Predictive Analytics, Big Data, Business Analytics, Data Science in Oil and Gas and Energy for an interview on how Natural Language Processing (NLP) plays a role in Big Data. The questions were great and so thought to reproduce the interview transcript here. This is with Francisco Sanchez, Data Scientist from Swift Energy.
In this month's question, we would like to focus our attention on Natural Language Processing (NLP) and the uses in the Energy industry. The energy industry is inundated with Land contracts, PSC's, legal files, etc. mainly unstructured data. The human language and interaction with the relationships it has with other variables is a very tough nut to crack. Here to shed light on this is Mr. J. Brooke Aker (email@example.com), Chief Data Scientist at Big Data Lens. www.bigdatalens.com
Q: For those not too familiar with NLP, could describe what it is and its applications? Is it what Watson uses?
Sure thing. NLP is the science of getting a machine to read and understand words for context. For example, take the sentence “My jaguar eats gas when I step on the gas”. Lots of keyword-based technology (including Google) cannot distinguish the first gas (as in gasoline) and the second gas (as in gas peddle). It may also confuse jaguar as a cat rather than a car. So NLP adds a layer of processing to assign meaning to words rather than treat than as simple tokens (e.g. strings of letters).
NLP also adds a layer of logic between words so you can understand actions, directions and connections between objects, as well as a categorization scheme.
After 30-some years practitioners agree, the best NLP approach mimics the way we were taught to read and understand as children. That is what we have done at Big Data Lens.
And while I don’t know the exact way Watson works the general approach is likely the same. Note however, that Watson works because the domain it was operating in is constrained which will always boost your chances of getting good resolution over unstructured data.
Q: Is it a possible that Earnings Release webcasts and scripts, Google Finance, company's own website, or their SEC filings and the way they word these items could be affecting their stock price?
Oh Yes. In the early days of sentiment analysis (a sub-field of NLP), even with only crude positive or negative detection algorithms, you could easily prove that such news had about a 24 hour lag time before impact on stock prices. The lag was the time it took for good or bad news to filter through to all those who had their fingers on their stock trade buttons.
Today, with social media, I am sure the lag time is shorter. But along the way we also discovered simple positive / negative algorithms were not very good. Suppose a tweet said “The features on that lousy BP fracking drill are actually really good”. So what is bad and what is good. Even for some human readers it’s tough to untangle. Sophisticated NLP will be able to know its BP that is lousy and the drill features that are good.
Q: The Energy industry is engulfed with legal papers, contracts, written logs, etc. how can NLP put these together and find insight?
When you have good NLP technology you can begin to connect logic between sentences, then paragraphs, whole documents and finally between documents.
Then one of the most important innovations is the ability to conduct inference across all documents. This is typically done with a graph database. This is a different kind of database since it records relationships and not just items and attributes. With it you can infer things. For example, if John knows Mary and Tom knows Mary we can infer John knows Tom as well. The more instances of each the stronger the inference is.
Q: Could it be possible to keep tabs on what your competitors are saying or posting in social media such as Facebook, Twitter, or other web postings and predict items such as Divesting or Acquiring companies, lower/higher earnings, or anything else that could be useful or is this cynical?
Not at all. This is a very good use of NLP but also requires some predictive modeling. For example, you might look backwards in time and see that Exxon-Mobil nearly always forms a partnership with a company, then later upgrades that to a joint venture, and finally markets and sells a product together prior to acquiring the company.
This pattern of behavior relies on the NLP to find enough evidence across the web and then machine learning is used to model it. This becomes a predictive analytics algorithm you simply feed a constant stream of news and social media too and then graphical show the probability of a new M&A event rising as the pattern of behavior evidence piles up.
Q: What do you think the future will hold for NLP?
We are working all the time to advance the state of NLP, machine learning, modeling and predictive analytics. Using this combination of technologies you can use many years of internal company information and turn it into predictive algorithms.
For example, input the many years of drilling reports from the field, almost all of which are written long hand. But they represent the collective knowledge of field experience. Using the above tools you could turn this into a real time guidance system as pressure, temperature, viscosity, salinity readings indicate to you all is clear or there is trouble upcoming the deeper you drill. Like having hundreds of long ago retired drill chiefs looking over your shoulder providing a life-time of experience so disaster is avoided.
There is also the “Internet of Things” (IoT) movement which is all about a huge increase in the number of small sensor located on pipelines, drills, engines, refineries, etc. reporting a constant stream of data. Using Big Data techniques on this data is also about predicting and heading off trouble or improving operations. There is less NLP required in this scenario but plenty of machine learning and predictions.
Q: Could you provide some examples of what your company has done in the Energy industry?
We have applied the drilling report example I described above. We also have used similar techniques to qualify vendors and track that they are not headed for some kind of internal legal or financial trouble that would cause risk downstream for our customer. Same for monitoring and predicting the outcome of regulations at the state and federal level. Or even using these techniques to compare and contrast vast numbers of patents between yourself and competitor to manage your IP portfolio.
And so on. Once you get the basic recipe down and have good technology you can apply this to all kinds of business issues with good success.
Q: Anything else you would like to add.
Only that an aspect of this kind of work you don’t always see talked about is creativity. Often companies we work for have a desire to execute on some predictive work as described above. They hand us data to work with and we find as often as not that the data is thin and not sufficient to prove anything. So we have to put on our creativity hat to find public data, or proxy data we can use to supplement the sample data. We have gotten very good at this and it is a different kind of skill than the NLP or statistical manipulation inherent in machine learning. Be sure to ask any vendor you might use if they are good and creative data sleuths.