I'm trying to initialize a data.frame without any rows. Basically, I want to specify the data types for each column and name them, but not have any rows created as a result.
The best I've been able to do so far is something like:
df <- data.frame(Date=as.Date("01/01/2000", format="%m/%d/%Y"),
File="", User="", stringsAsFactors=FALSE)
df <- df[-1,]
Which creates a data.frame with a single row containing all of the data types and column names I wanted, but also creates a useless row which then needs to be removed.
Is there a better way to do this?
Just initialize it with empty vectors:
df <- data.frame(Date=as.Date(character()),
File=character(),
User=character(),
stringsAsFactors=FALSE)
Here's an other example with different column types :
df <- data.frame(Doubles=double(),
Ints=integer(),
Factors=factor(),
Logicals=logical(),
Characters=character(),
stringsAsFactors=FALSE)
str(df)
> str(df)
'data.frame': 0 obs. of 5 variables:
$ Doubles : num
$ Ints : int
$ Factors : Factor w/ 0 levels:
$ Logicals : logi
$ Characters: chr
N.B. :
Initializing a data.frame
with an empty column of the wrong type does not prevent further additions of rows having columns of different types.
This method is just a bit safer in the sense that you'll have the correct column types from the beginning, hence if your code relies on some column type checking, it will work even with a data.frame
with zero rows.
If you already have an existent data frame, let's say df
that has the columns you want, then you can just create an empty data frame by removing all the rows:
empty_df = df[FALSE,]
Notice that df
still contains the data, but empty_df
doesn't.
I found this question looking for how to create a new instance with empty rows, so I think it might be helpful for some people.
df[NA,]
this will affect the index as well (which is unlikely to be what you want), I would instead use df[TRUE,] = NA
; however notice that this will overwrite the original. You will need to copy the dataframe first copy_df = data.frame(df)
and then copy_df[TRUE,] = NA
empty_df
with empty_df[0:nrow(df),] <- NA
.
You can do it without specifying column types
df = data.frame(matrix(vector(), 0, 3,
dimnames=list(c(), c("Date", "File", "User"))),
stringsAsFactors=F)
You could use read.table
with an empty string for the input text
as follows:
colClasses = c("Date", "character", "character")
col.names = c("Date", "File", "User")
df <- read.table(text = "",
colClasses = colClasses,
col.names = col.names)
Alternatively specifying the col.names
as a string:
df <- read.csv(text="Date,File,User", colClasses = colClasses)
Thanks to Richard Scriven for the improvement
read.table(text = "", ...)
so you don't need to explicitly open a connection.
read.csv
approach also works with readr::read_csv
, as in read_csv("Date,File,User\n", col_types = "Dcc")
. This way you can directly create an empty tibble of the required structure.
Just declare
table = data.frame()
when you try to rbind
the first line it will create the columns
rbind
this would work well, if not...
rbind()
.
The most efficient way to do this is to use structure
to create a list that has the class "data.frame"
:
structure(list(Date = as.Date(character()), File = character(), User = character()),
class = "data.frame")
# [1] Date File User
# <0 rows> (or 0-length row.names)
To put this into perspective compared to the presently accepted answer, here's a simple benchmark:
s <- function() structure(list(Date = as.Date(character()),
File = character(),
User = character()),
class = "data.frame")
d <- function() data.frame(Date = as.Date(character()),
File = character(),
User = character(),
stringsAsFactors = FALSE)
library("microbenchmark")
microbenchmark(s(), d())
# Unit: microseconds
# expr min lq mean median uq max neval
# s() 58.503 66.5860 90.7682 82.1735 101.803 469.560 100
# d() 370.644 382.5755 523.3397 420.1025 604.654 1565.711 100
data.table
is usually contains a .internal.selfref
attribute, which cannot be faked without calling the data.table
functions. Are you sure you are not relying on an undocumented behavior here?
data.table
and assumed that Google did find what I wanted and everything here is data.table
-related.
data.frame()
provides checks on naming, rownames, etc.
If you are looking for shortness :
read.csv(text="col1,col2")
so you don't need to specify the column names separately. You get the default column type logical until you fill the data frame.
Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 2
readr
can do it: read_csv2("a;b;c;d;e\n", col_types = "icdDT")
. There need to be \n
to regognize it is string not a file (or use c("a;b;c;d;e", "")
. As a bonus column names won't be modified (e.g. col-1
or why spaces
)
I created empty data frame using following code
df = data.frame(id = numeric(0), jobs = numeric(0));
and tried to bind some rows to populate the same as follows.
newrow = c(3, 4)
df <- rbind(df, newrow)
but it started giving incorrect column names as follows
X3 X4
1 3 4
Solution to this is to convert newrow to type df as follows
newrow = data.frame(id=3, jobs=4)
df <- rbind(df, newrow)
now gives correct data frame when displayed with column names as follows
id nobs
1 3 4
To create an empty data frame, pass in the number of rows and columns needed into the following function:
create_empty_table <- function(num_rows, num_cols) {
frame <- data.frame(matrix(NA, nrow = num_rows, ncol = num_cols))
return(frame)
}
To create an empty frame while specifying the class of each column, simply pass a vector of the desired data types into the following function:
create_empty_table <- function(num_rows, num_cols, type_vec) {
frame <- data.frame(matrix(NA, nrow = num_rows, ncol = num_cols))
for(i in 1:ncol(frame)) {
print(type_vec[i])
if(type_vec[i] == 'numeric') {frame[,i] <- as.numeric(frame[,i])}
if(type_vec[i] == 'character') {frame[,i] <- as.character(frame[,i])}
if(type_vec[i] == 'logical') {frame[,i] <- as.logical(frame[,i])}
if(type_vec[i] == 'factor') {frame[,i] <- as.factor(frame[,i])}
}
return(frame)
}
Use as follows:
df <- create_empty_table(3, 3, c('character','logical','numeric'))
Which gives:
X1 X2 X3
1 <NA> NA NA
2 <NA> NA NA
3 <NA> NA NA
To confirm your choices, run the following:
lapply(df, class)
#output
$X1
[1] "character"
$X2
[1] "logical"
$X3
[1] "numeric"
If you want to create an empty data.frame with dynamic names (colnames in a variable), this can help:
names <- c("v","u","w")
df <- data.frame()
for (k in names) df[[k]]<-as.numeric()
You can change the types as well if you need so. like:
names <- c("u", "v")
df <- data.frame()
df[[names[1]]] <- as.numeric()
df[[names[2]]] <- as.character()
If you don't mind not specifying data types explicitly, you can do it this way:
headers<-c("Date","File","User")
df <- as.data.frame(matrix(,ncol=3,nrow=0))
names(df)<-headers
#then bind incoming data frame with col types to set data types
df<-rbind(df, new_df)
By Using data.table
we can specify data types for each column.
library(data.table)
data=data.table(a=numeric(), b=numeric(), c=numeric())
If you want to declare such a data.frame
with many columns, it'll probably be a pain to type all the column classes out by hand. Especially if you can make use of rep
, this approach is easy and fast (about 15% faster than the other solution that can be generalized like this):
If your desired column classes are in a vector colClasses
, you can do the following:
library(data.table)
setnames(setDF(lapply(colClasses, function(x) eval(call(x)))), col.names)
lapply
will result in a list of desired length, each element of which is simply an empty typed vector like numeric()
or integer()
.
setDF
converts this list
by reference to a data.frame
.
setnames
adds the desired names by reference.
Speed comparison:
classes <- c("character", "numeric", "factor",
"integer", "logical","raw", "complex")
NN <- 300
colClasses <- sample(classes, NN, replace = TRUE)
col.names <- paste0("V", 1:NN)
setDF(lapply(colClasses, function(x) eval(call(x))))
library(microbenchmark)
microbenchmark(times = 1000,
read = read.table(text = "", colClasses = colClasses,
col.names = col.names),
DT = setnames(setDF(lapply(colClasses, function(x)
eval(call(x)))), col.names))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# read 2.598226 2.707445 3.247340 2.747835 2.800134 22.46545 1000 b
# DT 2.257448 2.357754 2.895453 2.401408 2.453778 17.20883 1000 a
It's also faster than using structure
in a similar way:
microbenchmark(times = 1000,
DT = setnames(setDF(lapply(colClasses, function(x)
eval(call(x)))), col.names),
struct = eval(parse(text=paste0(
"structure(list(",
paste(paste0(col.names, "=",
colClasses, "()"), collapse = ","),
"), class = \"data.frame\")"))))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# DT 2.068121 2.167180 2.821868 2.211214 2.268569 143.70901 1000 a
# struct 2.613944 2.723053 3.177748 2.767746 2.831422 21.44862 1000 b
If you already have a dataframe, you can extract the metadata (column names and types) from a dataframe (e.g. if you are controlling a BUG which is only triggered with certain inputs and need a empty dummy Dataframe):
colums_and_types <- sapply(df, class)
# prints: "c('col1', 'col2')"
print(dput(as.character(names(colums_and_types))))
# prints: "c('integer', 'factor')"
dput(as.character(as.vector(colums_and_types)))
And then use the read.table
to create the empty dataframe
read.table(text = "",
colClasses = c('integer', 'factor'),
col.names = c('col1', 'col2'))
I keep this function handy for whenever I need it, and change the column names and classes to suit the use case:
make_df <- function() { data.frame(name=character(),
profile=character(),
sector=character(),
type=character(),
year_range=character(),
link=character(),
stringsAsFactors = F)
}
make_df()
[1] name profile sector type year_range link
<0 rows> (or 0-length row.names)
Say your column names are dynamic, you can create an empty row-named matrix and transform it to a data frame.
nms <- sample(LETTERS,sample(1:10))
as.data.frame(t(matrix(nrow=length(nms),ncol=0,dimnames=list(nms))))
This question didn't specifically address my concerns (outlined here) but in case anyone wants to do this with a parameterized number of columns and no coercion:
> require(dplyr)
> dbNames <- c('a','b','c','d')
> emptyTableOut <-
data.frame(
character(),
matrix(integer(), ncol = 3, nrow = 0), stringsAsFactors = FALSE
) %>%
setNames(nm = c(dbNames))
> glimpse(emptyTableOut)
Observations: 0
Variables: 4
$ a <chr>
$ b <int>
$ c <int>
$ d <int>
As divibisan states on the linked question,
...the reason [coercion] occurs [when cbinding matrices and their constituent types] is that a matrix can only have a single data type. When you cbind 2 matrices, the result is still a matrix and so the variables are all coerced into a single type before converting to a data.frame
Success story sharing
data.frame
's have typed columns, so yes, if you want to initialize adata.frame
you must decide the type of the columns...data.frame(Doubles=rep(as.double(NA),numberOfRow), Ints=rep(as.integer(NA),numberOfRow))
data has 0
rows error?