ChatGPT解决这个技术问题 Extra ChatGPT

Extracting specific columns from a data frame

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.

Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:

 data.frame(df$A,df$B,df$E)

Is there a more compact way of doing this?


J
Joshua Ulrich

You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.

# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]

Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.

str(df["A"])
## 'data.frame':    1 obs. of  1 variable:
## $ A: int 1
str(df[,"A"])  # vector
##  int 1

Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).

# subset (original solution--not recommended)
df[,c("A","B","E")]  # returns a data.frame
df[,"A"]             # returns a vector

That gives the error object of type 'closure' is not subsettable.
@ArenCambre: then your data.frame isn't really named df. df is also a function in the stats package.
@Cina: Because -"A" is a syntax error. And ?Extract says, "i, j, ... can also be negative integers, indicating elements/slices to leave out of the selection."
There is an issue with this syntax because if we extract only one column R, returns a vector instead of a dataframe and this could be unwanted: > df[,c("A")] [1] 1. Using subset doesn't have this disadvantage.
S
Sam Firke

Using the dplyr package, if your data.frame is called df1:

library(dplyr)

df1 %>%
  select(A, B, E)

This can also be written without the %>% pipe as:

select(df1, A, B, E)

Given the considerably evolution of the Tidyverse since posting my question, I've switched the answer to you.
Given the furious rate of change in the tidyverse, I would caution against using this pattern. This is in addition to my strong preference against treating column names as if they are object names when writing code for functions, packages, or applications.
It has been over four years since this answer was submitted, and the pattern hasn't changed. Piped expressions can be quite intuitive, which is why they are appealing.
You'd chain together a pipeline like: df1 %>% select(A, B, E) %>% rowMeans(.). See the documentation for the %>% pipe by typing ?magrittr::`%>%`
This is a useful solution, but for the example given in the question, Josh's answer is more readable, faster, and dependency free. I hope new users learn square bracket subsetting before diving in the tidyverse :)!
U
Uli Köhler

This is the role of the subset() function:

> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> subset(dat, select=c("A", "B"))
  A B
1 1 3
2 2 4

When I try this, with my data, I get the error: " Error in x[j] : invalid subscript type 'list' " But if c("A", "B") isn't a list, what is it?
@Rafael_Espericueta Hard to guess without viewing your code... But c("A", "B") is a vector, not a list.
It convert data frame to list.
H
Henry

There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or

df[,c(1,2,5)]

as in

> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> df
  A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
  A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
  A B E
1 1 3 8
2 2 4 8

s
so860

For some reason only

df[, (names(df) %in% c("A","B","E"))]

worked for me. All of the above syntaxes yielded "undefined columns selected".


A
Arthur Yip

Where df1 is your original data frame:

df2 <- subset(df1, select = c(1, 2, 5))

This doesn't use dplyr. It uses base::subset, and is identical to Stephane Laurent's answer except that you use column numbers instead of column names.
G
Gilad Green

You can also use the sqldf package which performs selects on R data frames as :

df1 <- sqldf("select A, B, E from df")

This gives as the output a data frame df1 with columns: A, B ,E.


m
moodymudskipper

You can use with :

with(df, data.frame(A, B, E))

M
Mohamed Rahouma
df<- dplyr::select ( df,A,B,C)

Also, you can assign a different name to the newly created data

data<- dplyr::select ( df,A,B,C)

This was already in the accepted answer
f
fxi

[ and subset are not substitutable:

[ does return a vector if only one column is selected.

df = data.frame(a="a",b="b")    

identical(
  df[,c("a")], 
  subset(df,select="a")
) 

identical(
  df[,c("a","b")],  
  subset(df,select=c("a","b"))
)

Not if you set drop=FALSE. Example: df[,c("a"),drop=F]