I have a data frame. Let's call him bob
:
> head(bob)
phenotype exclusion
GSM399350 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399351 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399352 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399353 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399354 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399355 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
I'd like to concatenate the rows of this data frame (this will be another question). But look:
> class(bob$phenotype)
[1] "factor"
Bob
's columns are factors. So, for example:
> as.character(head(bob))
[1] "c(3, 3, 3, 6, 6, 6)" "c(3, 3, 3, 3, 3, 3)"
[3] "c(29, 29, 29, 30, 30, 30)"
I don't begin to understand this, but I guess these are indices into the levels of the factors of the columns (of the court of king caractacus) of bob
? Not what I need.
Strangely I can go through the columns of bob
by hand, and do
bob$phenotype <- as.character(bob$phenotype)
which works fine. And, after some typing, I can get a data.frame whose columns are characters rather than factors. So my question is: how can I do this automatically? How do I convert a data.frame with factor columns into a data.frame with character columns without having to manually go through each column?
Bonus question: why does the manual approach work?
bob
.
Just following on Matt and Dirk. If you want to recreate your existing data frame without changing the global option, you can recreate it with an apply statement:
bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)
This will convert all variables to class "character", if you want to only convert factors, see Marek's solution below.
As @hadley points out, the following is more concise.
bob[] <- lapply(bob, as.character)
In both cases, lapply
outputs a list; however, owing to the magical properties of R, the use of []
in the second case keeps the data.frame class of the bob
object, thereby eliminating the need to convert back to a data.frame using as.data.frame
with the argument stringsAsFactors = FALSE
.
To replace only factors:
i <- sapply(bob, is.factor)
bob[i] <- lapply(bob[i], as.character)
In package dplyr in version 0.5.0 new function mutate_if
was introduced:
library(dplyr)
bob %>% mutate_if(is.factor, as.character) -> bob
...and in version 1.0.0 was replaced by across
:
library(dplyr)
bob %>% mutate(across(where(is.factor), as.character)) -> bob
Package purrr from RStudio gives another alternative:
library(purrr)
bob %>% modify_if(is.factor, as.character) -> bob
purrr
line returns a list, not a data.frame
!
i
that is a vector of colnames()
.
modify_if
instead of map_if
from the very beginning :)
The global option
stringsAsFactors: The default setting for arguments of data.frame and read.table.
may be something you want to set to FALSE
in your startup files (e.g. ~/.Rprofile). Please see help(options)
.
If you understand how factors are stored, you can avoid using apply-based functions to accomplish this. Which isn't at all to imply that the apply solutions don't work well.
Factors are structured as numeric indices tied to a list of 'levels'. This can be seen if you convert a factor to numeric. So:
> fact <- as.factor(c("a","b","a","d")
> fact
[1] a b a d
Levels: a b d
> as.numeric(fact)
[1] 1 2 1 3
The numbers returned in the last line correspond to the levels of the factor.
> levels(fact)
[1] "a" "b" "d"
Notice that levels()
returns an array of characters. You can use this fact to easily and compactly convert factors to strings or numerics like this:
> fact_character <- levels(fact)[as.numeric(fact)]
> fact_character
[1] "a" "b" "a" "d"
This also works for numeric values, provided you wrap your expression in as.numeric()
.
> num_fact <- factor(c(1,2,3,6,5,4))
> num_fact
[1] 1 2 3 6 5 4
Levels: 1 2 3 4 5 6
> num_num <- as.numeric(levels(num_fact)[as.numeric(num_fact)])
> num_num
[1] 1 2 3 6 5 4
as.character(f)
, is better in both readability and efficiency to levels(f)[as.numeric(f)]
. If you wanted to be clever, you could use levels(f)[f]
instead. Note that when converting a factor with numeric values, you do get some benefit from as.numeric(levels(f))[f]
over, e.g., as.numeric(as.character(f))
, but this is because you only have to convert the levels to numeric and then subset. as.character(f)
is just fine as it is.
If you want a new data frame bobc
where every factor vector in bobf
is converted to a character vector, try this:
bobc <- rapply(bobf, as.character, classes="factor", how="replace")
If you then want to convert it back, you can create a logical vector of which columns are factors, and use that to selectively apply factor
f <- sapply(bobf, class) == "factor"
bobc[,f] <- lapply(bobc[,f], factor)
I typically make this function apart of all my projects. Quick and easy.
unfactorize <- function(df){
for(i in which(sapply(df, class) == "factor")) df[[i]] = as.character(df[[i]])
return(df)
}
Another way is to convert it using apply
bob2 <- apply(bob,2,as.character)
And a better one (the previous is of class 'matrix')
bob2 <- as.data.frame(as.matrix(bob),stringsAsFactors=F)
as.data.frame(lapply(...
Update: Here's an example of something that doesn't work. I thought it would, but I think that the stringsAsFactors option only works on character strings - it leaves the factors alone.
Try this:
bob2 <- data.frame(bob, stringsAsFactors = FALSE)
Generally speaking, whenever you're having problems with factors that should be characters, there's a stringsAsFactors
setting somewhere to help you (including a global setting).
bob
to begin with (but not after the fact).
Or you can try transform
:
newbob <- transform(bob, phenotype = as.character(phenotype))
Just be sure to put every factor you'd like to convert to character.
Or you can do something like this and kill all the pests with one blow:
newbob_char <- as.data.frame(lapply(bob[sapply(bob, is.factor)], as.character), stringsAsFactors = FALSE)
newbob_rest <- bob[!(sapply(bob, is.factor))]
newbob <- cbind(newbob_char, newbob_rest)
It's not good idea to shove the data in code like this, I could do the sapply
part separately (actually, it's much easier to do it like that), but you get the point... I haven't checked the code, 'cause I'm not at home, so I hope it works! =)
This approach, however, has a downside... you must reorganize columns afterwards, while with transform
you can do whatever you like, but at cost of "pedestrian-style-code-writting"...
So there... =)
At the beginning of your data frame include stringsAsFactors = FALSE
to ignore all misunderstandings.
If you would use data.table
package for the operations on data.frame then the problem is not present.
library(data.table)
dt = data.table(col1 = c("a","b","c"), col2 = 1:3)
sapply(dt, class)
# col1 col2
#"character" "integer"
If you have a factor columns in you dataset already and you want to convert them to character you can do the following.
library(data.table)
dt = data.table(col1 = factor(c("a","b","c")), col2 = 1:3)
sapply(dt, class)
# col1 col2
# "factor" "integer"
upd.cols = sapply(dt, is.factor)
dt[, names(dt)[upd.cols] := lapply(.SD, as.character), .SDcols = upd.cols]
sapply(dt, class)
# col1 col2
#"character" "integer"
In [<-.data.table(*tmp*, sapply(bob, is.factor), : Coerced 'character' RHS to 'double' to match the column's type. Either change the target column to 'character' first (by creating a new 'character' vector length 1234 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'double' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
It's easier to fix the DF and recreate the DT.
This works for me - I finally figured a one liner
df <- as.data.frame(lapply(df,function (y) if(class(y)=="factor" ) as.character(y) else y),stringsAsFactors=F)
New function "across" was introduced in dplyr version 1.0.0. The new function will supersede scoped variables (_if, _at, _all). Here's the official documentation
library(dplyr)
bob <- bob %>%
mutate(across(where(is.factor), as.character))
This function does the trick
df <- stacomirtools::killfactor(df)
You should use convert
in hablar
which gives readable syntax compatible with tidyverse
pipes:
library(dplyr)
library(hablar)
df <- tibble(a = factor(c(1, 2, 3, 4)),
b = factor(c(5, 6, 7, 8)))
df %>% convert(chr(a:b))
which gives you:
a b
<chr> <chr>
1 1 5
2 2 6
3 3 7
4 4 8
Maybe a newer option?
library("tidyverse")
bob <- bob %>% group_by_if(is.factor, as.character)
With the dplyr
-package loaded use
bob=bob%>%mutate_at("phenotype", as.character)
if you only want to change the phenotype
-column specifically.
This works transforming all to character and then the numeric to numeric:
makenumcols<-function(df){
df<-as.data.frame(df)
df[] <- lapply(df, as.character)
cond <- apply(df, 2, function(x) {
x <- x[!is.na(x)]
all(suppressWarnings(!is.na(as.numeric(x))))
})
numeric_cols <- names(df)[cond]
df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
return(df)
}
Adapted from: Get column types of excel sheet automatically
Success story sharing
type.convert
after casting everything tocharacter
, then recastfactors
back tocharacter
again.bob[] <-
in the example orbob <-
?; the first keeps the data.frame; the second changes the data.frame to a list, dropping the rownames. I will update the answeriris[] <- lapply(iris, function(x) if (is.factor(x)) as.character(x) else {x})