Comment

What does human comprehension mean?

When we consider how to make better Natural Language Processing (NLP) technology we are always thinking in terms of human comprehension as the benchmark.  When I read something how do I make sense of it?  What does my brain do and how fast does it do it?  What kind of processing power does it take to do that?

We came across some interesting facts to answer that last question.  It is widely agreed that the processing power equivalent of the human brain would be 100 million million-instructions-per-second (MIPS) in speed and 100 million megabytes (MB) in memory.  No wonder you need 8 hours of sleep per night to clear out all the leftover junk from processing that much information all day long. 

And everyone loves to talk about IBM's Watson winning Jeopardy.  But as this article reminds us "Watson needed 16 terabytes of memory, 90 powerful servers, a total of 2880 processor cores, and mind-boggling quantities of electrical power just to wrap its big computery head around the concept of wordplay".  So that was no chip on a board that beat Ken Jennings. 

But we are headed in the right direction with the work-a-day solutions we use here at Big Data Lens - smarter algorithms, more efficient code, faster processing using cloud infrastructure and more.  We may not get to the magic "100 and 100" benchmark even in our lifetime but for each new leap there are excellent new vistas to explore and lots of new insight to be gained.  

Comment

Comment

Machine Learning Tools Compared

At Big Data Lens we use a lot of tools to estimate Machine Learning models.  Namely, we have employed R, Weka, Orange and an online service called BigML.  We won't get into the technical details here from a comparison perspective.  Just the the good, bad and ugly of when and how to use one tool over the other. 

BigML is an online service for uploading data and creating Machine Learning models in a fairly automated way.  It is fantastically easy to use, very graphically driven and works quickly to get you answers.  The number of estimation techniques is limited though they seem to be adding to this all the time.  What we really like about this is the ability to have the model spit itself out directly in code that can be deployed right away.  Saves a lot of time.  Small data sets and models are free. When you need something fast and you don't need to mull over a dozen estimation techniques BigML works great.  We like these guys so give them a try. 

Orange is a slick desktop app that you can workflow your data through a number of steps and estimation techniques.  It uses a widget and connector construction on the GUI to move you from raw data to finished model.  The number of estimation techniques here is greater than with BigML.  Estimation reporting is good but for the true statistician in you may find it frustrating to compare models beyond summary measurements.  Orange includes modules for bioinformatics making it unique and rich for the life science crowd.  It also include a text mining module but as experts in this field we can tell you to avoid this part of Orange.   So overall a bit richer and a bit more sophisticated than BigML and appropriate for quick comparisons across estimation techniques.

Weka is a MOOC and university based project.  Now in it's 3rd generation, It includes far more estimation techniques that our first 2 tools.  But for all its power in estimation techniques it lacks in the user interface department.  In out testing and use stability sometimes seemed not up to par even though our data sets were relatively small.  But if you like a community approach with lots of plug-ins and don't care about the GUI then Weka can be a good choice.  Additionally, for the stats-maven in you Weka will report all the dirty laundry you want. 

R is like Weka only better in our estimation.  The community seems bigger (some will probably argue this), the product seems more stable, and the available IDEs to ride on top of R are good and there is more than one choice.  We use RStudio for example. But for us the most important differentiation is what seems like the endless estimation techniques you can find and deploy within R.  While writing this blog we checked the number of packages on the CRAN mirror you can install inside R at 5,527.  Checking the last 3 days suggested > 10 new or updated packages each day.  For difficult problems you would be hard pressed not to find an available estimation package to try.  Now its true the graphics require further plug-ins, you have to write your own (sometimes) lengthy scripts and read somewhat verbose logs to do cross-comparisons of model outputs.  But the control and variety in solving the toughest Machine Learning problems make R hard to beat.  Now they just need a better logo. 

Comment

Comment

Search and Destroy

We added a demo to our site today.  Finally.  It takes a long time to agree and then build a demo application that highlights the core of what you are and what you do.  Plus we have to satisfy customers in the meantime.

But today we launched SNAP - Social News Analysis Platform.  See it under the new demo tab.  There you add one or more search terms just like any search engine.  Using the slide bar you can ask for more or less results by the number of days backwards to look.  Click Search and SNAP not just fetches what you are looking for but analyzes it. 

This is what a Whole New Web looks like.  What do 1,267 look like?  What do that many web pages, news stories, social media contributions tell you?  How are things related?  Where are things happening in total?  What are people talking about categorically?  Who is society are things affecting?  And how to do they feel about?

All of that was always present in the way in which you used the web.  But now you can see it.  And use it.  So go ahead.  Search and destroy the boundaries that made the web difficult to use.  Go ahead and use SNAP. 

Then tell us what you think.  We'd love to hear from you and make improvements. 

Comment

2014 with the Go Language

Comment

2014 with the Go Language

Usually we don't write much about the underlying technology that makes Big Data Lens work.  We decided to change that here in 2014.  We doing it for several reasons.  

One is the contribution sharing experiences about the use of different technology makes to the tech community as a whole.  We benefit from others experience and so we should make similar contributions to others.  its just part of the tech ethos.  

Another reason is we just get excited about certain technology when it works well.  The Go Language is a case in point.  Now we know not to many readers get excited about about a programming language so we'll keep it high level.  

First, the Go Language http://golang.org/ was started by Google.  We don't know for sure but assume some of Google is built on this language.  So it must be good.  Turns out it is.  As the website states;

"Go is expressive, concise, clean, and efficient. Its concurrency mechanisms make it easy to write programs that get the most out of multicore and networked machines, while its novel type system enables flexible and modular program construction. Go compiles quickly to machine code yet has the convenience of garbage collection and the power of run-time reflection. It's a fast, statically typed, compiled language that feels like a dynamically typed, interpreted language."

So that is a mouthful but what it really means is the this language is FAST.  We have written our core Natural Language Processing (NLP) in Java, Kotlin and now Go.  Go gives us about a 30% speed advantage over the other compiled languages.  We no longer measure our processing in milliseconds (a thousandth of a second) - we measure only in microseconds (a millionth of a second).

Fast processing is the key to making Web Services work on a economic and customer service level.  A 30% time savings on a small batch of processing is nice.  But a 30% savings on Big Data scale processing is huge and means less cloud costs, less time customers wait for results, and savings that can be passed on in lower prices.

Go gofer.png

We'll detail other aspects of the Go language over the course of the year.  But there is one more aspect worth mentioning - the icon.  Every really good programming language has a some icon to it up.  Used to be very techy kinds of things to show the seriousness of the effort.  In later years we have switched to friendly looking animals or robots.  Go has this happy looking gofer ready to chew through the toughest problems.


  

Comment

Comment

Not Long From Now ...

Big Data holds a lot of promise.  Some is current.  Some far off.  At Big Data Lens we look at the middle distance.  

We don't believe slapping the a label on small sets of data using traditional techniques makes it Big Data.  I am reminded of this when thinking about a presentation I sat through at a Big Data conference this past summer listening to a big company finance guy crow about his wondrous automatic manipulation of expense spreadsheets.  This he called Big Data.  It is not.  The presenter who followed, however, had it right.  His definition stands.  Big Data is about doing what could NOT be done before.  That means the volume of data processed (in a reasonable amount of time) using techniques never applied before.  

So the middle distance is not about processing expense spreadsheets that could have be done with a small python script on 10 minutes work.  Nor is Big Data about curing cancer in the short term - though it certainly may play a part in this. in the long term. 

The middle distance is about using the whole internet in new ways, to find products that protect soldiers from risk that have not been applied before as we do with the Department of Defense.  Or finding technologies to reduce risk and improve planning as we do for the Army.  Or even finding and analyzing disparate sources of data for compliance and optimization as we do for a healthcare company.    

In the slightly longer middle distance Big Data is about using the internet to understand competitors patterns of behavior and predict their next moves.  Or to understand market dynamics and customer groups in ways unknown before to streamline operations.  Or to allow social media to inform product design in ways that obviate the need for expensive and costly surveys, research and so on.    

The middle distance for Big Data is about being practical.  The middle distance for Big Data is about getting the most for this new science without having to wait a long long time.  The middle distance is the sweet spot for Big Data.    

 

Comment

Big Data From Twitter

Comment

Big Data From Twitter

I both like and dislike this article from Smithsonian.com today about an analysis of Twitter data.  

I like the analysis because it adheres to the right principal about Big Data - namely that simply collecting the data without analysis does nothing for you.  So the study authors did do a lot of good analysis on place, language, time of day, distance between tweets and retweets and so on.  Done properly, it gives an interesting picture of the way humans communicate - early and often - as the saying goes for voting.  

I like even better the added data the investigators brought into the examination.  They studied the time of the tweet so they could know if natural or artificial light was present during the tweet itself.  Subtracting these and mapping it shows either the lack of rural electrification or that Twitter is a banned political substance in some parts of the world (e.g. Iran and China).  The great thing the investigators did here is the use of data, properly analyzed, tells you a lot about other things you were not expecting to see or even asking questions about.  Something that is self-evident and powerful once it is revealed.  That is the essence of good Big Data.

What I don't like is what they could have done with the Twitter data.  Certainly I care about the novel insights they can glean from the data.  But in the end they never tried to look at what people were saying, what did they mean - what fears and pain, what successes and joys are all those millions and millions of people talking about?  Thats how we would have approached this endeavor here at Big Data Lens.  

Maybe it's just dribble.  But like the discovery of politics and energy you can probably be sure there is something deep and powerful in all that human communication waiting to be found. 

Comment

4 Comments

Correlation & Causality

Yesterday we caught the columnist David Brooks' piece on Big Data.  You can find it here.  He is right to point out that efforts by Big Data proponents to break the long-standing axiom that "correlation does not equal causality" in data science is misguided.  

correlation.png

This phrase "Correlation does not equal causality" was never meant to downplay or white wash any of the analytical techniques that can be misused to establish causality.  Rather it was meant to make sure data scientists added the narrative to the statistics. 

The narrative is the explanation of why the correlation appears.  20 years ago diapers and beer being bought at the same time in the convenience store was more than a correlation.  The narrative that went with it - young fathers sent out into the dark cold night to fetch might as well as well pick up some new family stress relief along the way - launched an entire new field of Information Technology called Business Intelligence.  Without the narrative its just statistics, and might well be a spurious correlation and little understanding in it.  The beer and diapers would have stayed in separate isles. 

Policy makers in government change course when they hear a story backed up by numbers.  Business leaders reallocate resources when they hear a story backed up by numbers.  If you are part of Big Data do the numbers, build the machine learning algorithm, but tell the story also.  Else your efforts will be far less impactful than they should be.  

4 Comments

Comment

Cool Old Techniques (part 2)

Big Data is breathing some new life into a another analytical technique we dusted off recently.  It's about understanding the development of technology over time into finished products.  

Every company who dreams up a new product is faced with choices along the way in terms of inputs.  Buy or build, buy this one or that one, buy that one but modify it and so on.  Each input choice made influences the timing and finished product features.  Input A is better for the finished product but might take longer to complete holding everything up so perhaps the lesser input B is sufficient.  

Big Data comes in because it helps inform the choices, capabilities and timing of the inputs into the finished new product.  The analytical technique then is one of specifying each input in terms of the probability of completion and capabilities by a certain time.  Now you have a network you can simulate.  It looks (in a simplified form) something like this;  

Screen Shot 2013-04-15 at 1.43.51 PM.png

Inputs are specified by probabilities of time to completion and capabilities.  Combining inputs into sub-systems which are then also specified by probabilities of time to completion and capabilities.  Note also that the choice of one input over another are present.  Or in some cases the condition of needing 2 inputs to be complete instead of a choice one over another is also present.  

 These networks can get very big very fast.  In the past, this technique might be used to model a future enemy weapons system.  A sub or a plane is fiendishly complex so simulation techniques are used to estimate likely outcomes, outcomes with the shortest timeframe, outcomes with the most robust capabilities and so on.  But also in the past the networks were so complex and time consuming to estimate they were rarely kept up to date.  But Big Data comes into play now by keeping such networks alive and useful - both changes to the inputs as well as the faster machine learning re-estimations that make the simulations useful.  

Comment

Comment

Cool Old Techniques

Recently we've been looking back at some old analytical techniques to see if they apply in a big data era.  Some do and some don't.  Those that do share some common traits ... they are insightful in ways that stand the test of time, and they are just plain useful.

One such technique looks at the kind of things that throw off every forecast - things that have not yet happened and so were never built into a model or equation.  There are lots of things like this.  In the old days you would seek out subject matter experts and gather a range of opinions on things that could bugger up the future.  Now we plug and play big data techniques to survey and account for things that have not yet happened.  

Gather enough of these and you then must simulate the outcome to see how things combine to make multiple futures.  Hedging our bets you say?  No No.  Its how you use the range of outcomes that makes it powerful.  

Planning is always about managing uncertainty and that is exactly what a range of outcomes gives us.  Ask "how likely is it the market demand for our goods will be at level X?" and you get the answer 73%.  That is certainly a better answer than a point forecast you know is wrong before you have even heard the answer.  

Smart companies use this old technique in the big data era to be more thorough in their planning, risk assessments and resource allocations.  Governments use this technique to measure the likelihood a foe will acquire a capability or advance a technology.

In the end this old technique lives again.  In fact it gets better with more data, accounting for futures as big data collection and analysis techniques inform them where we used to rely on only a small number of subject matter experts.  Welcome to Big Data.

tia.png

Comment

Comment

Big Data Ideas that Oppose

“The more data that you have, the better the model that you will be able to build. Sure, the algorithm is important, but whoever has the most data will win.”—Gil Elbaz, Factual

“With any Big Data project, if you don’t spend time thinking about analysis, you’re wasting your money. You must have a structured idea of what you want to get out of unstructured data”—Ron Gill, NetSuite

So lets see if I understand.  With Big Data any lack of precision can be beat by adding more data to the algo.  Pour more into the sausage grinder and you get better sausage.  Simple right?

But wait.  Our second quote tells us to carefully plan and understand the nature of Big Data so you know what algo to apply to it.  Or all is lost.  

So here are two Big Data ideas that oppose each other.  In reality you need a bit of both.  Or as the the saying goes this is a case of each idea being necessary but not sufficient on its own.  But take them together and you do have the makings of good Big Data practice.

In any project, many iterations are needed to discover the inflection point where more data generates a diminishing return in terms of a quality algorithmic outcome.  You must seek it, no matter how large the data as input turns out to be, since to do otherwise sacrifices accuracy.  

But likewise, not having a theoretic underpinning - that structured plan to handle the unstructured data - is a fools errand.  Too many Big Data projects rely on simply adding more data to goose the algo's precision.  Spurious is the term typically used when you can observe high correlation values but not explain them logically.  Without this "story-telling" understanding of the algorithm you are on shaky ground.  

Use both these ideas as good practice for good Big Data - everyday.  

Comment

Comment

Microsoft getting into Big Data ... or they always were

News items like this one from the New York Times are great for detailing how companies large and small are getting into Big Data.  So now Microsoft is getting into the game as well.  Or is it?  Seems to us they always were.  Even if they never called it Big Data.

What is striking is how Microsoft portrays the use of Big Data - as a power behind their current products (e.g. Excel or Outlook email) and not for stand-alone use.  Their chief data scientist is portrayed as working on "incremental" steps for the last 20 years.  Really?  And 100 of their PhD's come together to brainstorm new ways to use Big Data and still it looks like it all hidden inside the current crop of products.  

Plenty of cash, bright minds and no deadlines sounds like a recipe of middle of the road Big Data nothingness to us.  Nothing focuses the mind like a bootstrap, small budget, gotta-get-it-done yesterday Big Data project.  That's where we have found the most success.  You should too.

Comment

Comment

A need for Trust in Big Data

There seems to be a divide between Big Data sources when it comes to trust.  Big Data efforts based on internal sources (e.g. log data, transactions, etc.) are trusted.  But apparently Big Data efforts based on external sources in the form of social media etc is not.  See this IBM study for more details.

But this may be due to one of two things.  The first is simple experience with the external data.  Yes it comes from the wild and wholly world of the internet but like the internal data that does not mean there isn't gold in there. 

Which leads to the second point - technique.  There is a wild array of techniques you can use to analyze your Big Data.  Finding the right size, shape and analytical technique in order to find that gold is not easy.  It takes trial and error and great skill in interpreting the results.  

Like all marketing efforts, Big Data is now suffering from "the gloss over effect".  We are given the impression that Big Data value is as simple as hooking a pipe of Big Data into a magic analytical box and out pops the gold you were told is hiding there.  Sorry - not that easy.  

So the realism that sets in after giving your Big Data this initial try might also contribute to the skepticism.  Big Data should be thought of more like a science project - embarking on a journey of discovery.  It takes many dead ends and blind alleys before you reach the promised land.  But it is there.    

Comment

Comment

Why Not Make Your Own Big Data?

Lots of studies look at the sources of Big Data.  The info-graphic from IBM shown here suggests that the majority of Big Data comes from 6 sources.  You might call these sources "the usual suspects".  But to our way of thinking what percentage of all the worlds information do these 6 sources represent?  Not much is the answer. Think 80/20.  80% of the worlds information is in a written (e.g. words) form.  None of the 6 sources listed here are words - well social media is but limited in its length.

Transient

So where is the rest of the worlds information and does it constitute Big Data?  It is the long form written news, blogs (like this one), journals, books and so on.  And yes that too is Big Data.  Or it can be once it is processed.  

That is where Statistical Natural Language Processing (sNLP) comes in.  Transforming the written word into machine understood data and viola you have Big Data too. Now mix and mosh it with the transactional, log data, and so as listed above and you've got a rich stew from which you can model and predict.  

Competitors actions - they're in the text.  Scientific discoveries - they're in the text.  Changes in regulation - they're in the text, not in log files and transactional data.  Include the written stuff by using sNLP and you open a whole new world of Big Data for your use.  

So if the above chart looks pretty limiting in terms of what you are going to get out of Big Data because what goes in is pretty limited think again and start generating your own Big Data - and profit by it. 

Comment

Comment

Caution - messy landscape ahead

A landscape is a fashionable thing in the tech world these days.  It is an attempt to put all important players and their contribution to the ecosystem into a single picture.  We found a couple of these recently online for Big Data.  One of these is below.

We cannot for the life of us figure out if a landscape like this is needed because a complicated Big Data ecosystem needs some 'splaining or if such a complicated landscape is meant to impress.  The cynic in us says both. 

Our read on landscapes like this are two fold.  One it says the industry is not mature.  So caution lots of consolidation ahead.  But, two, much more importantly, if you are a newbie to Big Data this picture is going to scare the hell out of you.  You'll retreat to the safety of your BI intelligence tools (which you could argue don't belong on any Big Data landscape) and never give Big Data a try.  To complex.  Methods aren't yet formed.  Too much trial and error and no real ROI yet.  The boss will give any budget to buy from this spaghetti western.   So a landscape really doesn't advance Big Data into the mainstream.  

At Big Data Lens we take a different approach.  We try to provide an end to end service that takes the worry and hassle out of data design, data acquisition, data massage, data analytics, data prediction and in the end the production of VALUE from Big Data.

Ignore scary landscapes.  Stick to simple steps with simple tools.  That is how you get value from Big Data.

Comment

Comment

Politics & Poll of Polls = Big Data

Even if you are not a fan of The New York Times Nate Silver's Blog called FiveThirtyEight - Political Calculus is an excellent example of Big Data at work.  It contains all the elements you should expect from any Big Data effort.  Lots of input in this case are the various polls conducted around the country.  Then lots of analysis about the data.  Including easy to understand charts and graphs that tell a story.  But also the written explanation - pointing out things you might miss in a graph or detailing things a graph can never display.  

So whatever your political leanings wade into this and see what Big Data looks like from a polished professional effort.

Transient

Comment

Comment

What is more important data or analytics?

Big Data is such a catchall phrase.  Besides the on-going arguments about the definition of Big Data we'd like to inject a different debate.  Namely, what is and also what should be the balance between the data side of Big Data and analytics side of Big Data.  

So here is a crude measure to figure out what is the current balance between data and analytics.  The search index trend over the last 18 months or so between the search terms "Big Data" and "Big Data Analytics"  That picture is shown here, where the blue line is the trend line on "Big Data" and the red line is the trend line on "Big Data Analytics".

Transient

Two things pop out.  One is Big Data started before anything thought was given to the analytics side of the equation.  Second, the analytics side of Big Data is not gathering as much speed as Big Data broadly.  

This is troubling.  Why?  Because in the end Big Data should not be about the acquisition, manipulation and storage of just more data.  You have to know why you need to collect it, what you intend to do with, and what question the data should help answer.  And that has everything to do with the kind of analytics you are going to subject the data too.  

The real work of Big Data comes before you ever start collecting one kilobyte of data.  More on that next time.

Comment

Comment

Sold out - Yikes NYC Strata / Hadoop Conf.

Only in it's second year, the O'Reilly Strata / Hadoop Conference is sold out?  Really?  Says a lot about either the popularity or the myth of Big Data.  We choose to believe the crush of attendees is about the popularity and importance of Big Data and not people trying add to the mythology of it all.  

Then if you check the Data Business Meetup happening at Bloomberg headquarters during DataWeek NYC you find 390 attendees who want to hear from the all start VC casting call.  No spots left to attend this either.  

Lots of Big Data talk and walk next week in NYC.  We'll be making the rounds, look for us !!

Transient

Comment

Comment

Data Emanation

From our devices, our checkins, our emails and interactions - even our bodies - we emanate data.  A lot of it.  Get your hands on enough of it and let a machine learn from it , dispensing with theories or preconceived notions, and thats what Big Data is.    

It's why Big Data is different from Business Intelligence.  It's why Big Data is different than the last IT buzzword and why it has staying power.  Because the data is not going away.  It's only going to grow and move faster.  We needed new tools, new thinking and new velocity to utilize this raw resource from the internet to gain insight, make customers happy, avoid mistakes, beat the competition, discover new medicine and make government work better.  So we got Big Data.  

Long live Big Data !

Transient

Comment

Comment

Big Data Spending to Double by 2016

Gartner does a bang-up job racking and stacking the number to size up a market.  Here they have done this by looking at all aspects of spending on big data.  This includes the continued need to understand customers, the use of social media to do so and buying the big data expertise even as they build those capabilities in-house. 

Rest of the article is here.

Transient

Comment

Comment

Big Data Geeks - hiding in plain site

Our guess is you have not been handed a business card that declares someone's title as "Big Data Scientist" or "VP of Big Data" just yet.  No worries.  Professionals who wield power over data have long been among us - just not perhaps in the most glamourous of roles.  Up till now.  

Turns out the new rock stars are those who can whiz through mountains of data.  And typically they are called actuaries, economists, mathematicians or to pull a really old-ish title out of the bin - operations research.  You gotta love this interactive graph from McKinsey that shows how many Big Data pros there are already and where they work.  Snatch them up if you can !

Find the rest of the report here

Comment