At Big Data Lens we use a lot of tools to estimate Machine Learning models.  Namely, we have employed R, Weka, Orange and an online service called BigML.  We won't get into the technical details here from a comparison perspective.  Just the the good, bad and ugly of when and how to use one tool over the other. 

BigML is an online service for uploading data and creating Machine Learning models in a fairly automated way.  It is fantastically easy to use, very graphically driven and works quickly to get you answers.  The number of estimation techniques is limited though they seem to be adding to this all the time.  What we really like about this is the ability to have the model spit itself out directly in code that can be deployed right away.  Saves a lot of time.  Small data sets and models are free. When you need something fast and you don't need to mull over a dozen estimation techniques BigML works great.  We like these guys so give them a try. 

Orange is a slick desktop app that you can workflow your data through a number of steps and estimation techniques.  It uses a widget and connector construction on the GUI to move you from raw data to finished model.  The number of estimation techniques here is greater than with BigML.  Estimation reporting is good but for the true statistician in you may find it frustrating to compare models beyond summary measurements.  Orange includes modules for bioinformatics making it unique and rich for the life science crowd.  It also include a text mining module but as experts in this field we can tell you to avoid this part of Orange.   So overall a bit richer and a bit more sophisticated than BigML and appropriate for quick comparisons across estimation techniques.

Weka is a MOOC and university based project.  Now in it's 3rd generation, It includes far more estimation techniques that our first 2 tools.  But for all its power in estimation techniques it lacks in the user interface department.  In out testing and use stability sometimes seemed not up to par even though our data sets were relatively small.  But if you like a community approach with lots of plug-ins and don't care about the GUI then Weka can be a good choice.  Additionally, for the stats-maven in you Weka will report all the dirty laundry you want. 

R is like Weka only better in our estimation.  The community seems bigger (some will probably argue this), the product seems more stable, and the available IDEs to ride on top of R are good and there is more than one choice.  We use RStudio for example. But for us the most important differentiation is what seems like the endless estimation techniques you can find and deploy within R.  While writing this blog we checked the number of packages on the CRAN mirror you can install inside R at 5,527.  Checking the last 3 days suggested > 10 new or updated packages each day.  For difficult problems you would be hard pressed not to find an available estimation package to try.  Now its true the graphics require further plug-ins, you have to write your own (sometimes) lengthy scripts and read somewhat verbose logs to do cross-comparisons of model outputs.  But the control and variety in solving the toughest Machine Learning problems make R hard to beat.  Now they just need a better logo. 

Comment