By: Ashley J Printer Friendly Format
According to an IBM Report, openings for data analytics jobs in the US will rise to 2.72 million by 2020. It is not surprising that there are still quite a number of people using spreadsheets like Excel or Google Sheets to crunch numbers. And there are others who use proprietary statistical software such as SAS, Stata, SPSS etc.
While Excel and SAS are powerful tools, they have their limitations. For example Excel cannot handle data sets above certain sizes. Tools like SAS or closed source and therefore there are not contributors who can add newer features to it. So there is a big gap here for people who want to do complex analytics and customize it to their needs. The next step for these people who reached the edge of these programs is to learn R or Python.
Data analysts and Data Scientists use R and Python extensively. R and Python are open source. For anyone interested in machine learning, working with large datasets, or creating complex data visualizations, R and Python comes handy. R is more for statistical analysis while Python is more for general purpose programming.
Often people ask which one is better to learn? R or Python. Python is better for for data manipulation and repeated tasks, while R is good for ad hoc analysis and exploring datasets. For example, take text analysis, where you want to deconstruct paragraphs into words or phrases and then identify patterns. In this use case R is better suited and makes it simple. On the other hand, take for example, pulling the data, to running automated analyses over and over, to producing visualizations like maps and charts from the results then Python is better suited.
And comparing the learning curve, Python is relatively easy to learn compared to R which may be a bit intimidating for beginners. Another advantage for Python is that it is a general purpose programming language which makes it easy for doing stuff other than for analytics. While Python is more like a programming languate and is suited for programmers, R is more of a statistical language and may be confusing for some.
But for data analysis, the differences between R and Python are starting to diminish. Most of the common tasks once associated with one program or the other are now doable in both. So it is matter of self preference for choosing one over the other. As you can see, Python and R both have their pros and cons. Selecting one over the other will depend on the use-cases, the cost of learning, and other common tools required.
When to use R?
R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. When getting started with R, a good first step is to install the RStudio IDE. For easy quickstart analysis, you can use the following popular packages:
- dplyr, plyr and data.table to easily manipulate packages,
- stringr to manipulate strings,
- zoo to work with regular and irregular time series,
- ggvis, lattice, and ggplot2 to visualize data, and
- caret for machine learning
When to use Python?
If you need to integrate data analysis tasks with web apps or if statistics code needs to be incorporated into a production database then you should probably use Python. Being a full-fledged programming language, it’s a great tool to implement algorithms for production use.
Being a general purpose language, Python did not have Data Analysis related packages in the past. We can safely say that, this has improved significantly over the years. To get started with Python for Data Analytics, install NumPy /SciPy (scientific computing) and pandas (data manipulation) to make Python usable for data analysis. Also have a look at matplotlib to make graphics, and scikit-learn for machine learning.
Given the background, there is a growing group of individuals using a combination of both languages when appropriate. If you’re planning to start a career in data science, you are good with both languages. Job trends indicated an increasing demand for both skills, and wages are well above average.
R: Pros and Cons
Pro: A picture says more
than a thousands words
Visualized data can often be understood more efficiently and effectively than the raw numbers alone. R and visualization are a perfect match. Some must-see visualization packages are ggplot2, ggvis, googleVis and rCharts.
R has a rich ecosystem of cutting-edge packages and active community. Packages are available at CRAN, BioConductor and Github. You can search through all R packages at Rdocumentation.
R lingua franca of
R is developed by statisticians for statisticians. They can communicate ideas and concepts through R code and packages, you don’t necessarily need a computer science background to get started. Furthermore, it is increasingly adopted outside of academia.
R is slow
R was developed to make the life of statisticians easier, not the life of your computer. Although R can be experienced as slow due to poorly written code, there are multiple packages to improve R’s performance: pqR, renjin and FastR, Riposte and many more.
R has a steep
R’s learning curve is non-trivial, especially if you come from a GUI for your statistical analysis. Even finding packages can be time consuming if you’re not familiar with it.
Python: Pros and Cons
Pro: IPython Notebook
The IPython Notebook makes it easier to work with Python and data. You can easily share notebooks with colleagues, without having them to install anything. This drastically reduces the overhead of organizing code, output and notes files. This will allow you to spend more time doing real work.
A general purpose
Python is a general purpose language that is easy and intuitive. This gives it a relatively flat learning curve, and it increases the speed at which you can write a program. In short, you need less time to code and you have more time to play around with it!
Furthermore, the Python testing framework is a built-in, low-barrier-to-entry testing framework that encourages good test coverage. This guarantees your code is reusable and dependable.
A multi purpose
Python brings people with different backgrounds together. As a common, easy to understand language that is known by programmers and that can easily be learnt by statisticians, you can build a single tool that integrates with every part of your workflow.
Visualizations are an important criteria when choosing data analysis software. Although Python has some nice visualization libraries, such as Seaborn, Bokeh and Pygal, there are maybe too many options to choose from. Moreover, compared to R, visualizations are usually more convoluted, and the results are not always so pleasing to the eye.
Python is a
Python is a challenger to R. It does not offer an alternative to the hundreds of essential R packages. It is however catching up.
Most Viewed Articles (in Data Science )
Latest Articles (in Data Science)
Comment on this tutorial
- Data Science
- Cloud Computing
- Java Beans
- Mac OS X
- Office 365
- Tech Reviews