According to an IBM Report, openings
for data analytics jobs in the US will rise to 2.72 million by 2020. It
is not surprising that there are still quite a number of people using
spreadsheets like Excel or Google Sheets to crunch numbers. And there
are others who use proprietary statistical software such as SAS, Stata,
While Excel and SAS are powerful tools, they have their limitations.
For example Excel cannot handle data sets above certain sizes. Tools
like SAS or closed source and therefore there are not contributors who
can add newer features to it. So there is a big gap here for people who
want to do complex analytics and customize it to their needs. The next
step for these people who reached the edge of these programs is to
learn R or Python.
Data analysts and Data Scientists use R
and Python extensively. R and
Python are open source. For anyone interested in machine learning,
working with large datasets, or creating complex data visualizations, R
and Python comes handy. R is more for statistical analysis while Python
is more for general purpose programming.
Often people ask which one is better to
learn? R or Python. Python is
better for for data manipulation and repeated tasks, while R is good
for ad hoc analysis and exploring datasets. For example, take text
analysis, where you want to deconstruct paragraphs into words or
phrases and then identify patterns. In this use case R is better suited
and makes it simple. On the other hand, take for example, pulling the
data, to running automated analyses over and over, to producing
visualizations like maps and charts from the results then Python is
And comparing the learning curve, Python
is relatively easy to learn
compared to R which may be a bit intimidating for beginners. Another
advantage for Python is that it is a general purpose programming
language which makes it easy for doing stuff other than for analytics.
While Python is more like a programming languate and is suited for
programmers, R is more of a statistical language and may be confusing
But for data analysis, the differences
between R and Python are
starting to diminish. Most of the common tasks once associated with one
program or the other are now doable in both. So it is matter of self
preference for choosing one over the other. As you can see, Python and
R both have their pros and cons. Selecting one over the other will
depend on the use-cases, the cost of learning, and other common tools
When to use R?
R is mainly used when the data analysis
standalone computing or analysis on individual servers. When getting
started with R, a good first step is to install the RStudio IDE.
For easy quickstart analysis, you can use the following
When to use Python?
If you need to integrate data analysis tasks with web apps or if
statistics code needs to be incorporated into a production database
then you should probably use Python. Being a full-fledged programming
language, it’s a great tool to implement algorithms for production use.
Being a general purpose language, Python
did not have Data Analysis
related packages in the past. We can safely say that, this has improved
significantly over the years. To get started with Python for Data
Analytics, install NumPy /SciPy (scientific
computing) and pandas (data manipulation) to make
Python usable for data analysis. Also have a look at matplotlib
to make graphics, and scikit-learn for machine learning.
As for Python IDE, have a look at Spyder,
Notebook and Rodeo to see which one best fits
Given the background, there is a growing
group of individuals using a
combination of both languages when appropriate. If you’re planning to
start a career in data science, you are good with both languages. Job
trends indicated an increasing demand for both skills, and wages are
well above average.
R: Pros and Cons
Pro: A picture says more
than a thousands words
Visualized data can often be understood more efficiently and
effectively than the raw numbers alone. R and visualization are a
perfect match. Some must-see visualization packages are ggplot2, ggvis,
googleVis and rCharts.
R has a rich ecosystem of cutting-edge packages and active community.
Packages are available at CRAN, BioConductor and Github. You can search
through all R packages at Rdocumentation.
R lingua franca of
R is developed by statisticians for statisticians. They can communicate
ideas and concepts through R code and packages, you don’t necessarily
need a computer science background to get started.
Furthermore, it is increasingly adopted outside of academia.
R is slow
R was developed to make the life of statisticians easier, not the life
of your computer. Although R can be experienced as slow due to poorly
written code, there are multiple packages to improve R’s performance:
pqR, renjin and FastR, Riposte and many more.
R has a steep
R’s learning curve is non-trivial, especially if you come from a GUI
for your statistical analysis. Even finding packages can be time
consuming if you’re not familiar with it.
Python: Pros and Cons
Pro: IPython Notebook
The IPython Notebook makes it easier to work with Python and data. You
can easily share notebooks with colleagues, without having them to
install anything. This drastically reduces the overhead of
organizing code, output and notes files. This will allow you to spend
more time doing real work.
A general purpose
Python is a general purpose language that is easy and intuitive. This
gives it a relatively flat learning curve, and it increases the speed
at which you can write a program. In short, you need less
time to code and you have more time to play around with it!
Furthermore, the Python testing framework
is a built-in,
low-barrier-to-entry testing framework that encourages good test
coverage. This guarantees your code is reusable and dependable.
A multi purpose
Python brings people with different backgrounds together. As a common,
easy to understand language that is known by programmers and that can
easily be learnt by statisticians, you can build a single tool that
integrates with every part of your workflow.
Visualizations are an important criteria when choosing data analysis
software. Although Python has some nice visualization libraries, such
as Seaborn, Bokeh and Pygal, there are maybe too many options to choose
from. Moreover, compared to R, visualizations are usually more
convoluted, and the results are not always so pleasing to the eye.
Python is a
Python is a challenger to R. It does not offer an alternative to the
hundreds of essential R packages. It is however catching up.