Friday, February 20, 2015

R dynamic report generation with Knitr


Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.

              Donald E. Knuth, Literate Programming, 1984


Overview and Motivation


So what is dynamic documentation and why do we need it. As opposed to usual programming, R programs were intended to used as report for not development oriented folks, whether they are data scientists, statisticians or managers. Moreover by nature, R programs don't tend to be huge spanning across hundreds of thousands of code lines. All these led to a huge demand good documentation framework. But how do you document a report? One may of course is to write a passage and then paste a copied graph into it, however once something changes one must re-copy all the graph, which is of course very tedious and non-rewarding procedure.

Knitr


Knitr is an R package that allows straightforward integration of R code for writing reports and is developed by Yihui Xie. It is very powerful and easy to get started with and has potentially a lot of uses.
All you need is to create a rmd file and then use a regular markdown syntax with some additional features. For example:
## Loading and preprocessing the data
```{r echo = TRUE}
data = read.csv("activity.csv")
summary(data)
```
Here we create a header and then insert a code snippet wrapped in ```{r} ```. echo = TRUE tells knitr you want to render the code and the results. That's it - it's that easy. The variables you create in one section are visible in others, so you can write your program as usual using any packages or functions you want. In the end, the knitting process will parse your document, run all the R snippents and append the code and the results, where needed, to the generated HTML or PDF file.

Knitr if fully integrated with RStudio, however if you're using another IDE or just a fan of a console, you can always knit your program by running the following command:
Rscript -e "library(knitr); knit('./file-here.rmd')"
For more advanced options and more detailed examples please read the documentation and demos sections in knitr site.

Publish


When it comes to sharing your report, it has always been an obstacle. An endless email thread with pdf attachments - sounds familiar? One of the best features of RStudio is an ability to publish the Knitr reports at rpubs.com. You can share the link then with everyone you need for them to view the report. Here is an example of my whether events analysis - Simple whether events analysis
Bare in mind and reports published on rPubs are publicly available, so you probably shouldn't publish something classified there.

Hope you enjoyed the article and stay tuned for the next one, of course :)
Friday, February 6, 2015

Data Scientist Toolkit



The history of technology is the history of the invention of tools and techniques, and is similar in many ways to the history of humanity. And since data scientists are mere mortals, they also need tools to make their work more productive and even enjoyable, but that's just me. In this article we'll be talking about main languages and tools used by data scientists. For ones who have recently entered this field of science, it will be a great overview about mostly used tools.

R


A great advantage of R is that scientists adopted it as their de facto standard. As a consequence, the latest cutting-edge techniques are first available in R. It also seems to be the preference of most Kaggle competition winners. Most of R practitioners use R Studio, and while they do offer a free community version, enterprise edition is a bit expensive. I use the community version only when I compete in Kaggle, haven't won anything yet :( Commercially I find Sublime Text a good alternative, since I use R only for proof of concepts, I'll return to this point later in my Python discussion. It's a very good and much cheaper IDE fully supporting R's syntax and being bundled with lots of top notch features.

Matlab/Octave


Matlab has been a top preference of algorithmists for years. It's a full blown solution with packages for every possible scientific field. The pricing however is what drove many away from it, as even a home usage version costs around 200$. As a result Octave was created to provide an open source alternative to run Matlab code. It's completely free, but you get what you pay for. There is no IDE and it is normally used through its interactive command line interface. Paired with Sublime Text it however provides a decent alternative to Matlab, if you don't need anything too fancy.

One thing to be aware of, is the fact that Octave's developers try to make Octave syntax "superior" to Matlab's. If it tries to be "better", it thus tries to be different, which is not in line with the reasons most people use it for. In my experience, running stuff developed in Matlab doesn't ever work in one go, except for the really simple, really short stuff. For any sizeable function, I always have to translate a lot of stuff before it works in Octave, if not re-write it from scratch.

Scala


The data boom has been sparked by the appearance of Hadoop ecosystem. Since Hadoop and all it's supportive tools are developed in Java, it sort of makes sense to analyse the data using the same language. Java however is not a very intuitive and easy to learn language, especially for non programmers. It was created specifically with the goal of being a better language, shedding those aspects of Java which it considered restrictive, overly tedious, or frustrating for the developer. Despite some appearance in data science community, it still remains mostly for data engineers usage and with a brisk pace of Python in Big Data domain, it has lost even more of it's relevance.

Python


According to Gartner:
the need for data scientists growing at about 3x those for statisticians and BI analysts, and an anticipated 100,000+ person analytic talent shortage through 2020
And as enterprises struggle to put data to work, they're also struggling to find qualified data scientists. More often than not, however, such data scientists may already work for them and likely have some familiarity with Python. It also much easier from the development perspective to implement everything in one language, since we really want those algorithms will need to get their way into working product some day.

As Tal Yakoni pointed out:
Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses
While R and Matlab production servers do exist, they are immensely expensive. That's why for decades, Matlab programs were reimplemented in C++ or Java to cut the costs. Python, however, can be used on any machine and with services like Amazon EC2, you can pay per hour of usage, making it affordable for any budget suffocated start-up. Moreover because Python is an object-oriented programming language, it’s easier to write large-scale, maintainable, and robust code with it than with R or Matlab. Using Python, the prototype code that you write on your own computer can be used as production code if needed, thus cutting enormously time to market time.

Python still lacks some of R's richness for data analysis, but it is closing the gap really fast. There are plenty of packages for any flavour, implemented in C, making them extremely fast.

Hope you enjoyed the article and feel free to share and comment.