I'd like to remove the lines in this data frame that:
a) contain NA
s across all columns. Below is my example data frame.
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA NA
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA NA NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Basically, I'd like to get a data frame such as the following.
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
b) contain NA
s in only some columns, so I can also get this result:
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Also check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
Try na.omit(your.data.frame)
. As for the second question, try posting it as another question (for clarity).
tidyr
has a new function drop_na
:
library(tidyr)
df %>% drop_na()
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 6 ENSG00000221312 0 1 2 3 2
df %>% drop_na(rnor, cfam)
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 4 ENSG00000207604 0 NA NA 1 2
# 6 ENSG00000221312 0 1 2 3 2
rnor
doesn't exist. why ?
rnor
is supposed to be a column name in your table
I prefer following way to check whether rows contain any NAs:
row.has.na <- apply(final, 1, function(x){any(is.na(x))})
This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:
sum(row.has.na)
and eventually drop them
final.filtered <- final[!row.has.na,]
For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.
rowSum(!is.na(final))
seems better suited than apply()
final[rowSum(is.na(final)),]
?
If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:
delete.na <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
By default, it will eliminate all NAs:
delete.na(final)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
Or specify the maximum number of NAs allowed:
delete.na(final, 2)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Another option if you want greater control over how rows are deemed to be invalid is
final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]
Using the above, this:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Becomes:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
...where only row 5 is removed since it is the only row containing NAs for both rnor
AND cfam
. The boolean logic can then be changed to fit specific requirements.
If performance is a priority, use data.table and na.omit() with optional param cols=.
na.omit.data.table
is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).
If you don't want to use data.table, use complete.cases().
On a vanilla data.frame
, complete.cases
is faster than na.omit()
or dplyr::drop_na()
. Notice that na.omit.data.frame
does not support cols=
.
Benchmark result
Here is a comparison of base (blue), dplyr
(pink), and data.table
(yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.
Your results may vary based on length, width, and sparsity of your particular dataset.
Note log scale on y axis.
https://i.stack.imgur.com/EtBWe.png
Benchmark script
#------- Adjust these assumptions for your own use case ------------
row_size <- 1e6L
col_size <- 20 # not including ID column
p_missing <- 0.05 # likelihood of missing observation (except ID col)
col_subset <- 18:21 # second part of question: filter on select columns
#------- System info for benchmark ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr); packageVersion('dplyr') # 0.7.4
library(tidyr); packageVersion('tidyr') # 0.8.0
library(microbenchmark)
#------- Example dataset using above assumptions --------------------
fakeData <- function(m, n, p){
set.seed(123)
m <- matrix(runif(m*n), nrow=m, ncol=n)
m[m<p] <- NA
return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)),
stringsAsFactors = FALSE),
data.frame(fakeData(row_size, col_size, p_missing) )
)
dt <- data.table(df)
par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
microbenchmark(
df[complete.cases(df), ],
na.omit(df),
df %>% drop_na,
dt[complete.cases(dt), ],
na.omit(dt)
), xlab='',
main = 'Performance: Drop any NA observation',
col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
microbenchmark(
df[complete.cases(df[,col_subset]), ],
#na.omit(df), # col subset not supported in na.omit.data.frame
df %>% drop_na(col_subset),
dt[complete.cases(dt[,col_subset,with=FALSE]), ],
na.omit(dt, cols=col_subset) # see ?na.omit.data.table
), xlab='',
main = 'Performance: Drop NA obs. in select cols',
col=c('lightblue','salmon',rep('beige',2))
)
Using dplyr package we can filter NA as follows:
dplyr::filter(df, !is.na(columnname))
drop_na()
drop_na
uses "any" logic and filter
uses "all" logic. So if you need more flexiblity in expression, filter has more possibilities.
This will return the rows that have at least ONE non-NA value.
final[rowSums(is.na(final))<length(final),]
This will return the rows that have at least TWO non-NA value.
final[rowSums(is.na(final))<(length(final)-1),]
For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.
final[!(rowSums(is.na(final))),]
For the second question, the code is just an alternation from the previous solution.
final[as.logical((rowSums(is.na(final))-5)),]
Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.
One approach that's both general and yields fairly-readable code is to use the filter()
function and the across()
helper functions from the {dplyr} package.
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter(across(one_of(vars_to_check),
~ !is.na(.x)))
# Filter all the columns to exclude NA
df %>%
filter(across(everything(),
~ !is.na(.)))
# Filter only numeric columns
df %>%
filter(across(where(is.numeric),
~ !is.na(.)))
Similarly, there are also the variant functions in the dplyr package (filter_all
, filter_at
, filter_if
) which accomplish the same thing:
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter_at(.vars = vars(one_of(vars_to_check)),
~ !is.na(.))
# Filter all the columns to exclude NA
df %>%
filter_all(~ !is.na(.))
# Filter only numeric columns
df %>%
filter_if(is.numeric,
~ !is.na(.))
across
We can also use the subset function for this.
finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))
This will give only those rows that do not have NA in both mmul and rnor
Assuming dat
as your dataframe, the expected output can be achieved using
1.rowSums
> dat[!rowSums((is.na(dat))),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
2.lapply
> dat[!Reduce('|',lapply(dat,is.na)),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
I am a synthesizer:). Here I combined the answers into one function:
#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
if (!is.null(col)) {
# R converts a single row/col to a vector if the parameter col has only one col
# see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
df.temp = df[,col,drop=FALSE]
} else {
df.temp = df
}
if (length(n)==1){
if (n==0) {
# simply call complete.cases which might be faster
result = df[complete.cases(df.temp),]
} else {
# credit: http://stackoverflow.com/a/30461945/2292993
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) sum(x) == n)
result = df[logindex, ]
}
}
if (length(n)==2){
min = n[1]; max = n[2]
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
result = df[logindex, ]
}
return(result)
}
dplyr 1.0.4 introduced two companion functions to filter
: they are if_any()
and if_all()
. The if_all()
companion function will be particularly useful in this case:
a) To remove rows that contain NAs across all columns
df %>%
filter(if_all(everything(), ~ !is.na(.x)))
This line will keep only those rows where none of the columns have NAs.
b) To remove rows that contain NAs in only some columns
cols_to_check = c("rnor", "cfam")
df %>%
filter(if_all(cols_to_check, ~ !is.na(.x)))
This line will check if any of the specified columns (cols_to_check) have NAs, and only keep those rows where this is not the case.
delete.dirt <- function(DF, dart=c('NA')) {
dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
DF <- DF[dirty_rows, ]
}
mydata <- delete.dirt(mydata)
Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA
and ?
change dart=c('NA')
in function param to dart=c('NA', '?')
My guess is that this could be more elegantly solved in this way:
m <- matrix(1:25, ncol = 5)
m[c(1, 6, 13, 25)] <- NA
df <- data.frame(m)
library(dplyr)
df %>%
filter_all(any_vars(is.na(.)))
#> X1 X2 X3 X4 X5
#> 1 NA NA 11 16 21
#> 2 3 8 NA 18 23
#> 3 5 10 15 20 NA
NA
. I think what the OP wants is: df %>% filter_all(all_vars(!is.na(.)))
Success story sharing
final[complete.cases(final),]
?complete.cases(final)
returns a boolean of rows where there are noNA
like(TRUE, FALSE, TRUE)
. The trailing comma says all columns. Thus, before the comma you filter on the rows but after the comma you column you do no filtering and ask for everythingcomplete.cases
statement.