By: Karthik Janar Printer Friendly Format
This tutorial discusses some basic concepts of Data Science. Data Scientist's Job always starts with a question that needs to be answered. There are few different kinds of questions a data scientist would ask which defines the goal of the analysis. It starts with descriptive, and then it goes to exploratory, inferential, predictive, causal, and mechanistic.
The goal here is just to describe a set of data. You're not trying to make any sort of decisions based on it or anything like that. It's the first kind of data analysis that was ever performed. And it's most commonly applied when you're talking about census data. The description and interpretation of this data are sort of different steps, so you gotta describe the data and then interpret what you've seen.
So a Descriptive Analysis focuses on just describing what the data is and not necessaryly to predict or infer anything from it. For example if the data is about Singapore Census, it merely interprets and describes about the population rather than trying to analyse and predict something. Another example could be Google Ngrams (http://books.google.com/ngrams) which takes one or two words and plots the observation of these words being mentioned in books over time. It is merely an observation.
The goal of explaratory analysis is to analyse some data and find new relationships that you dont know previously existed. But explaratory analysis usually discovers hidden connections and correlations but doesnt confirm them. So they're good for discovering new connections, and they're also useful to find, for defining future data science projects, where you're actually trying to confirm the exploration that you've performed. The important point is that you've probably heard before that correlation does not imply causation. So, you don't want to necessarily save it. You've discovered a relationship that is the critical relationship between two variables based on exploratory analysis alone.
For example, if you are looking at space data such as images that have been captured over time. You would probably discover some new stars or celestial objects that you didnt know existed before. And so, that data is actually used for exploration, but not necessarily for confirming anything that you discover.
The goal of Inferential Analysis is where, you're actually trying to take a small amount of data, on a small number of observations, and sort of extrapolate that information, or generalize that information to a larger population. Inference is definitely the most common goal of most statistical models and most statistics you may have heard about.
It depends heavily both on the population that you're looking at, the group of people or the group of objects that you care about, and a sampling you've discovered. For example if you would like to see 'if the air polution in Malaysia affects death rate of Malaysians', then you would probably take air polution samples of say two cities (KL and Malacca) and then the death rate of these two cities. What you did here is you've analysed, sort of a subset of the cities in Malaysia and you are using that to try to infer something about what's generally happening in the relationship between air pollution and life expectancy.
The goal here is to use data on some objects to predict values for another objects. So the idea is to use the data on some objects you collect the data on, to predict the values for another object for the next observation that comes to the door. It is good to note that if x predicts y, it does not mean x causes y.
An example would be to take sample data from polling and to predict who will win the elections. A predictive analysis done here http://fivethirtyeight.blogs.nytimes.com explores how it was done.
Causal Analysis something like a What-If Analysis where the goal is to find out what happens to one variable when you make changes to another variable. The gold standard for doing this in general is using randomized studies or randomized controlled trials to identify causation. And you can try to do it from just observed data that you have saved in the database.
An example here would be to test a new medicine for example with random people and then study if it really worked and hence establish a causal relationship between the drug and the cure.
It is very rarely the goal of most analysis. The idea is to understand the exact changes and variables that lead to exact changes in other variables. Therefore it is incredibly hard to infer exact changes unless there is a confirmed equation or theory that can be applied. The most common applications where this is possible is in the physical or engineering sciences where some more simplified models can describe a lot of the action that is happening.
Examples of mechanistic analysis usually tend to happen in physics or engineering applications.
Most Viewed Articles (in Data Science )
Latest Articles (in Data Science)
Comment on this tutorial
- Data Science
- Cloud Computing
- Java Beans
- Mac OS X
- Office 365
- Tech Reviews