Friday, February 6, 2015

Data Scientist Toolkit



The history of technology is the history of the invention of tools and techniques, and is similar in many ways to the history of humanity. And since data scientists are mere mortals, they also need tools to make their work more productive and even enjoyable, but that's just me. In this article we'll be talking about main languages and tools used by data scientists. For ones who have recently entered this field of science, it will be a great overview about mostly used tools.

R


A great advantage of R is that scientists adopted it as their de facto standard. As a consequence, the latest cutting-edge techniques are first available in R. It also seems to be the preference of most Kaggle competition winners. Most of R practitioners use R Studio, and while they do offer a free community version, enterprise edition is a bit expensive. I use the community version only when I compete in Kaggle, haven't won anything yet :( Commercially I find Sublime Text a good alternative, since I use R only for proof of concepts, I'll return to this point later in my Python discussion. It's a very good and much cheaper IDE fully supporting R's syntax and being bundled with lots of top notch features.

Matlab/Octave


Matlab has been a top preference of algorithmists for years. It's a full blown solution with packages for every possible scientific field. The pricing however is what drove many away from it, as even a home usage version costs around 200$. As a result Octave was created to provide an open source alternative to run Matlab code. It's completely free, but you get what you pay for. There is no IDE and it is normally used through its interactive command line interface. Paired with Sublime Text it however provides a decent alternative to Matlab, if you don't need anything too fancy.

One thing to be aware of, is the fact that Octave's developers try to make Octave syntax "superior" to Matlab's. If it tries to be "better", it thus tries to be different, which is not in line with the reasons most people use it for. In my experience, running stuff developed in Matlab doesn't ever work in one go, except for the really simple, really short stuff. For any sizeable function, I always have to translate a lot of stuff before it works in Octave, if not re-write it from scratch.

Scala


The data boom has been sparked by the appearance of Hadoop ecosystem. Since Hadoop and all it's supportive tools are developed in Java, it sort of makes sense to analyse the data using the same language. Java however is not a very intuitive and easy to learn language, especially for non programmers. It was created specifically with the goal of being a better language, shedding those aspects of Java which it considered restrictive, overly tedious, or frustrating for the developer. Despite some appearance in data science community, it still remains mostly for data engineers usage and with a brisk pace of Python in Big Data domain, it has lost even more of it's relevance.

Python


According to Gartner:
the need for data scientists growing at about 3x those for statisticians and BI analysts, and an anticipated 100,000+ person analytic talent shortage through 2020
And as enterprises struggle to put data to work, they're also struggling to find qualified data scientists. More often than not, however, such data scientists may already work for them and likely have some familiarity with Python. It also much easier from the development perspective to implement everything in one language, since we really want those algorithms will need to get their way into working product some day.

As Tal Yakoni pointed out:
Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses
While R and Matlab production servers do exist, they are immensely expensive. That's why for decades, Matlab programs were reimplemented in C++ or Java to cut the costs. Python, however, can be used on any machine and with services like Amazon EC2, you can pay per hour of usage, making it affordable for any budget suffocated start-up. Moreover because Python is an object-oriented programming language, it’s easier to write large-scale, maintainable, and robust code with it than with R or Matlab. Using Python, the prototype code that you write on your own computer can be used as production code if needed, thus cutting enormously time to market time.

Python still lacks some of R's richness for data analysis, but it is closing the gap really fast. There are plenty of packages for any flavour, implemented in C, making them extremely fast.

Hope you enjoyed the article and feel free to share and comment.