Wednesday, January 22, 2014

Motion Chart 2

JobsData
Data: march • Chart ID: JobsData
R version 2.15.2 Patched (2013-01-31 r61797) • googleVis-0.4.2Google Terms of UseData Policy

Tuesday, May 7, 2013

A GoogleVis Motion Chart in R

Here is a  motion chart I  made using R's GoogleVis Package. The x axis represents each state's spending per capita on elementary and secondary education and the y axis represents the high school graduation rate for each state.

Monday, March 4, 2013

The R Learning Curve in progress

About a month back I completed Computing for Data Analysis from Coursera.org and blogged about my learning experience. I have continued to play with R and it's different packages. I am doing a Data Analytics and Visualization course as part of my Master's program which is keeping me motivated to learn R further.

Over most of last month, I've been looking for R code I could learn from. Here are some projects/examples in R with code:

Making sense of the data from the Lending Club 
http://www.dataspora.com/2011/10/mining-lending-clubs-goldmine-of-loan-data-part-i-of-ii-visualizations-by-state/

Analysis of a dataset from Prosper.com
http://www.dataspora.com/2011/12/visualization-of-prosper-com%E2%80%99s-loan-data-part-i-compare-and-contrast-with-lending-club/

Visualizing connections
http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/

Predictive Analytics 
Preparing Data
http://horicky.blogspot.com/2012/05/predictive-analytics-overview-and-data.html

Sampling to ensure the training data is representative and fit into the machine processing capacity 
http://horicky.blogspot.com/2012/05/predictive-analytics-data-preparation.html

Machine learning techniques to build a predictive model
http://horicky.blogspot.com/2012/05/predictive-analytics-generalized-linear.html

More  machine learning techniques
http://horicky.blogspot.com/2012/06/predictive-analytics-neuralnet-bayesian.html
http://horicky.blogspot.com/2012/06/predictive-analytics-decision-tree-and.html

Data Analysis Examples
http://www.ats.ucla.edu/stat/dae/

Please feel free to add more examples/projects

Sunday, February 10, 2013

Resources to speed the R learning curve

I recently blogged about the learning curve  in R and posted it on on Hacker News. The response was overwhelming with lots of suggestions on where to go for more resources and further learning. I have complied them for those looking for resources to speed the R learning curve.

https://github.com/hadley/devtools/wiki  A place to learn R like a programming language, focusing on cross-cutting concerns and general concepts

http://www.r-bloggers.com/ An R blogging platform that has done a great job of promoting R and encouraging the community and gives a good sense of the state of R

http://tryr.codeschool.com/ Free R Tutorials from Code School and O'Reilly

http://cbio.ensmp.fr/~thocking/papers/2011-08-16-directlabels-and-regular-expressions-for-useR-2011/2011-useR-named-capture-regexp.pdf Fast, named capture regular expressions in R

http://www.win-vector.com/blog/2009/09/survive-r/ Survival guide

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf  A manual by Patrick Burns for R developers with lots of useful tricks and tips for reducing memory usage, improving performance, and avoiding errors in computational analysis

http://blog.revolutionanalytics.com/ Blog from the staff of Revolution Analytics on using R for big data analysis, predictive modeling, data science and more
 
Morte Tutorials



http://cran.r-project.org/manuals.html Docs 


http://onepager.togaware.com/ Handson Data Science with R
http://stackoverflow.com/questions/tagged/r?sort=votes&p... StackOverflow


Cheat sheeets



R Journal


http://rseek.org/ Rseek search engine
  
http://www.r-chart.com/ For experiences of web application/database developer whose tool kit includes R
 
Books
http://www.amazon.com/gp/product/0387981403 ggplot2: Elegant Graphics for Data Analysis  by Hadley Wickham
http://www.amazon.com/dp/1449316956/ref=cm_sw_su_dp R Graphics Cookbook  by Winston Chang
 
For more books relating to R
http://www.r-project.org/doc/bib/R-books.html

Thursday, February 7, 2013

The R Learning Curve

R is meant for statistical computing. Developed in New Zealand by two professors of statistics, it is often referred to as the language written by statisticians for statisticians.  R is a GNU Project, and is available as free software. Of recent, it has found favor with many data analysts as the big data takes center stage in many businesses and there is an increased appetite for flexible tools that can be fine tuned to match individual requirements.

The R manuals , available on the R project website has very clear explanations on installing R and guidance to to different R packages. The manuals also offer some basic tutorials on using R for statistical computing and plotting graphs.The Internet is a great resource for insights on how to get things done in R. Places like stackoverflow offer more than one technique to get things done in R.

I first tried R last spring before starting grad school. It was easy to set-up and install.The R-Project website has very straightforward information on setting up R.  There are several videos on youtube as well that helps one install R.

The initial learning experience is fun, especially if one is familiar with statistics. You don't have to type print to get an answer. The basic syntax felt like typing into a calculator. Most questions that pop into your head have an answer in a manual or one of the numerous websites out there. But that is where the honeymoon ends.

Once I go hooked on R, I decided it was time for some formal learning.  Coursera was offering Computing for Data Analysis with R. I signed up. I blogged about class  experience recently.  From my experience, the most challenging areas once one get a hang of R are -

Cleaning the data - This takes time and it can be annoying. I mean like a thorn in the flesh annoying. For me, it was trial and error. I found that regular expressions in R are a great way to isolate the string that one is looking for and get a data set with the values that can be worked on to tackle the problem in hand.

Finding the right Package - This one is tricky. Not all the things you want to do in R, you can do with the basic download. You will need to download packages. Reading what others have to say about the functionalities and matching it to your needs is the best way to go about this. Once you know the name of the package, a google search can easily help you locate it and most packages can be easily loaded and installed.

Writing Functions, Using Loops and Control Structures: Like other languages, this is purely the product of deliberate learning and practice. Unlike other languages, it is hard to come across snippets of code to find exactly what you are looking for. For me, this was the single most challenging part of learning R. Most help communities assume that you have some understanding of how R code works and you are familiar with the commands. My solution was to keep searching till I found it. It was not easy. I grilled away on my computer trying different ways to extract the information I wanted. Finally when I nailed it after several iterations, I felt a sense of accomplishment that would have escaped me had I just copied and tidied up the code.

Graphs and Plots: I found this exciting but there is a lot more to learn here. The class touched many aspects of graphing in R but there is a lot more to do.

Overall learning R has definitely been time consuming and frustrating but ultimately rewarding.

Wednesday, January 30, 2013

My most recent MOOC experience

I just finished a four week course offering in Computing for Data Analysis from Coursera. It was both exciting and challenging. Exciting  because I was learning how to use R in formal manner. I've played with some code in R but I've always wanted to learn it in a more structured way, and this was a great opportunity. Challenging  because  I had to put in the 5 to 6 hours of work everyweek to listen to videos, play around with enough code to do the assignments and  attempt the quizzes. I am glad I was not taking any grad coursework during the J-Term.

The videos were excellent. I love the way Dr Roger D. Peng makes you feel like he is talking exclusively to you. The discussion forums were excellent, and it was a great resource to look at when I was stuck.

The MOOCs I attended earlier were on topics I was doing as part of my coursework, or had taken courses on as a student. So, the learning experience here was different.

The learning process is iterative, you keep going back to the drawing board trying to a new way and then suddenly you nail it. Like any college level course, some parts are very interesting and some not so much.

The sheer amount of self-motivation required to keep yourself going is exhausting. Independent learning and some understanding of the topic are key ingredients to success at MOOCs. For me, I have taken several classes in statistics, so my struggle was with the computing part of the course. If I were not familiar with statistics, I may have stuggled even more or maybe just given up. All the same, there were times I wanted to just walk away and say 'I am not made for this', but I just hung on and I've done it!

My only complaint has more to do with R, than with the coursework. My greatest challenge was decipering the online exchanges and discussions in stackoverflow and other communities.

To finish  off, I would reccomend trying a relatively new topic on MOOC. It is a great way to challenge yourself and make use of the opportunity to learn courses offered by some of the best universities in the world and taught by some of the best minds in the field.

Friday, November 30, 2012

More places to learn Hadoop

This is a follow-up to my earlier post on starting to learn Hadoop. I came across a couple of interesting links to learning Hadoop and increasing understanding on how it works -


Learning “Machine Learning” by Example is a Meetup which you can join in off-site. It is good starter place for someone who is new to analytics. It is a wonderful opportunity to learn the concepts underlying Machine Learning and is based on the “learning by example” principle.

Here is a well constructed diagram on how Big Data gets stored and retrieved to and from a distributed file system like Hadoop

12/3

Just came  across these tutorials and think they are awesome
http://www.michael-noll.com/tutorials/
Yahoo's Hadoop Tutorial http://developer.yahoo.com/hadoop/tutorial/module1.html