Missing Values in R
By: Karthik Janar in data-science Tutorials on 2018-05-01
Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see if there's an underlying pattern or cause for their missingness.
In R, NA is used to represent any value that is 'not available" or 'missing" (in the statistical sense). In this tutorial, we"ll explore missing values further.
Any operation involving NA generally yields NA as the result. To illustrate, let's create a vector c(44, NA, 5, NA) and assign it to a variable x.
x <- c(44, NA, 5, NA)
Now, let's multiply x by 3.
x * 3
## [1] 132 NA 15 NA
Notice that the elements of the resulting vector that correspond with the NA values in x are also NA.
To make things a little more interesting, lets create a vector containing 1000 draws from a standard normal distribution with y <- rnorm(1000).
y <- rnorm(1000)
Next, let's create a vector containing 1000 NAs with z <- rep(NA, 1000).
z <- rep(NA, 1000)
Finally, let's select 100 elements at random from these 2000 values (combining y and z) such that we don't know how many NAs we"ll wind up with or what positions they"ll occupy in our final vector - my_data <- sample(c(y, z), 100).
my_data <- sample(c(y,z),100)
Let's first ask the question of where our NAs are located in our data. The is.na() function tells us whether each element of a vector is NA. Call is.na() on my_data and assign the result to my_na.
my_na <- is.na(my_data)
Now, print my_na to see what you came up with.
my_na
## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
## [12] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
## [34] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [45] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
## [56] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [67] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
## [78] FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
## [89] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## [100] TRUE
Everywhere you see a TRUE, you know the corresponding element of my_data is NA. Likewise, everywhere you see a FALSE, you know the corresponding element of my_data is one of our random draws from the standard normal distribution.
In this logical vector tutorial, we introduced the ==
operator as a method of testing for equality between two objects. So, you might think the expression my_data == NA yields the same results as is.na().
my_data == NA
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [70] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [93] NA NA NA NA NA NA NA NA
The reason you got a vector of all NAs is that NA is not really a value, but just a placeholder for a quantity that is not available. Therefore the logical expression is incomplete and R has no choice but to return a vector of the same length as my_data that contains all NAs.
So be cautious when using logical expressions anytime NAs might creep in, since a single NA value can derail the entire thing.
Now that we have a vector, my_na, that has a TRUE for every NA and FALSE for every numeric value, we can compute the total number of NAs in our data.
The trick is to recognize that underneath the surface, R represents TRUE as the number 1 and FALSE as the number 0. Therefore, if we take the sum of a bunch of TRUEs and FALSEs, we get the total number of TRUEs.
We can simply call the sum() function on my_na to count the total number of TRUEs in my_na, and thus the total number of NAs in my_data.
sum(my_na)
## [1] 47
Finally, let's take a look at the data to convince ourselves that everything 'adds up". Print my_data to the console.
my_data
## [1] 0.308640011 0.152658355 0.838526330 NA 1.829938956
## [6] NA 1.053580667 -2.015911304 NA 1.515250166
## [11] NA NA NA 0.004442903 NA
## [16] NA 1.328032209 NA NA -0.566152489
## [21] NA -1.209065883 NA NA NA
## [26] NA NA NA 0.587858796 NA
## [31] 0.578621401 1.000485002 -0.631205098 NA 0.834306319
## [36] NA NA 0.260352853 -0.738725548 -0.843865535
## [41] -1.246845096 0.064383360 0.408013650 NA 0.515279838
## [46] 0.486700980 NA NA NA NA
## [51] 1.076405928 NA NA -1.595102395 0.513133511
## [56] NA -1.545437626 NA -0.532494817 0.982962820
## [61] 0.369021274 NA -2.376474638 1.207950474 -1.176368630
## [66] NA 0.461151764 0.470368402 -0.689016269 0.120188849
## [71] NA -1.297659193 -0.671169981 0.554428281 NA
## [76] NA NA -0.648146122 NA NA
## [81] 1.015285724 0.594523212 NA -3.535239605 -0.925335815
## [86] NA 0.795023723 NA NA 0.380860630
## [91] -1.438414853 NA NA 0.191796328 NA
## [96] NA 0.594333902 -0.737094722 0.263012469 NA
Now let's look at a second type of missing value - NaN, which stands for 'not a number". To generate NaN, try dividing (using a forward slash) 0 by 0 now.
0/0
## [1] NaN
Let's do one more, just for fun. In R, Inf stands for infinity. What happens if you subtract Inf from Inf?
Inf - Inf
## [1] NaN
Add Comment
This policy contains information about your privacy. By posting, you are declaring that you understand this policy:
- Your name, rating, website address, town, country, state and comment will be publicly displayed if entered.
- Aside from the data entered into these form fields, other stored data about your comment will include:
- Your IP address (not displayed)
- The time/date of your submission (displayed)
- Your email address will not be shared. It is collected for only two reasons:
- Administrative purposes, should a need to contact you arise.
- To inform you of new comments, should you subscribe to receive notifications.
- A cookie may be set on your computer. This is used to remember your inputs. It will expire by itself.
This policy is subject to change at any time and without notice.
These terms and conditions contain rules about posting comments. By submitting a comment, you are declaring that you agree with these rules:
- Although the administrator will attempt to moderate comments, it is impossible for every comment to have been moderated at any given time.
- You acknowledge that all comments express the views and opinions of the original author and not those of the administrator.
- You agree not to post any material which is knowingly false, obscene, hateful, threatening, harassing or invasive of a person's privacy.
- The administrator has the right to edit, move or remove any comment for any reason and without notice.
Failure to comply with these rules may result in being banned from submitting further comments.
These terms and conditions are subject to change at any time and without notice.
Most Viewed Articles (in data-science ) What is Scrapy and how to use it. Manipulating Data with dplyr in R Introduction to logical operations in R Types of Analysis - Data Science Questions? Functions in R - Creating your first R function Data Analytics - Which programming language to learn. R vs Python Logical and Character Vectors in R |
Latest Articles (in data-science) |
- Data Science
- Android
- React Native
- AJAX
- ASP.net
- C
- C++
- C#
- Cocoa
- Cloud Computing
- HTML5
- Java
- Javascript
- JSF
- JSP
- J2ME
- Java Beans
- EJB
- JDBC
- Linux
- Mac OS X
- iPhone
- MySQL
- Office 365
- Perl
- PHP
- Python
- Ruby
- VB.net
- Hibernate
- Struts
- SAP
- Trends
- Tech Reviews
- WebServices
- XML
- Certification
- Interview
Comments