Thursday, February 7, 2013

The R Learning Curve

R is meant for statistical computing. Developed in New Zealand by two professors of statistics, it is often referred to as the language written by statisticians for statisticians.  R is a GNU Project, and is available as free software. Of recent, it has found favor with many data analysts as the big data takes center stage in many businesses and there is an increased appetite for flexible tools that can be fine tuned to match individual requirements.

The R manuals , available on the R project website has very clear explanations on installing R and guidance to to different R packages. The manuals also offer some basic tutorials on using R for statistical computing and plotting graphs.The Internet is a great resource for insights on how to get things done in R. Places like stackoverflow offer more than one technique to get things done in R.

I first tried R last spring before starting grad school. It was easy to set-up and install.The R-Project website has very straightforward information on setting up R.  There are several videos on youtube as well that helps one install R.

The initial learning experience is fun, especially if one is familiar with statistics. You don't have to type print to get an answer. The basic syntax felt like typing into a calculator. Most questions that pop into your head have an answer in a manual or one of the numerous websites out there. But that is where the honeymoon ends.

Once I go hooked on R, I decided it was time for some formal learning.  Coursera was offering Computing for Data Analysis with R. I signed up. I blogged about class  experience recently.  From my experience, the most challenging areas once one get a hang of R are -

Cleaning the data - This takes time and it can be annoying. I mean like a thorn in the flesh annoying. For me, it was trial and error. I found that regular expressions in R are a great way to isolate the string that one is looking for and get a data set with the values that can be worked on to tackle the problem in hand.

Finding the right Package - This one is tricky. Not all the things you want to do in R, you can do with the basic download. You will need to download packages. Reading what others have to say about the functionalities and matching it to your needs is the best way to go about this. Once you know the name of the package, a google search can easily help you locate it and most packages can be easily loaded and installed.

Writing Functions, Using Loops and Control Structures: Like other languages, this is purely the product of deliberate learning and practice. Unlike other languages, it is hard to come across snippets of code to find exactly what you are looking for. For me, this was the single most challenging part of learning R. Most help communities assume that you have some understanding of how R code works and you are familiar with the commands. My solution was to keep searching till I found it. It was not easy. I grilled away on my computer trying different ways to extract the information I wanted. Finally when I nailed it after several iterations, I felt a sense of accomplishment that would have escaped me had I just copied and tidied up the code.

Graphs and Plots: I found this exciting but there is a lot more to learn here. The class touched many aspects of graphing in R but there is a lot more to do.

Overall learning R has definitely been time consuming and frustrating but ultimately rewarding.

5 comments:

  1. I think I'll have to go back and play with R, I tried the Coursera course but the lectures confused me quite a bit.

    ReplyDelete
  2. You might want to look at this course as well
    http://www.codeschool.com/courses/try-r

    ReplyDelete
  3. If you keep going further in formal learning soon will ask yourself "How I create my own plot and summary methods". Then the three class system of R will really blown you mind.

    I have some experience with R to say I don't know where R fits in language niche. R is super slow if you consider building a model inside it and greedy..."yom yom I like your 32gb of ram" <- uncommon if you work with large DB.
    Like you said "R is meant for statistical computing" I expected the language to be "fast" and "efficient". Check some benchmarks at http://julialang.org/

    So i restate: R is meant to be easy for you program and solve statistical problems at cost of performance.

    ReplyDelete
  4. There are much better applications for the Extract Transform and Load of data which ease the burden of cleaning data including Pentaho.

    ReplyDelete
  5. Coursera is running another data analysis course in R through Johns Hopkins: https://www.coursera.org/course/dataanalysis

    ReplyDelete