ChatGPT解决这个技术问题 Extra ChatGPT

For each row in an R dataframe

I have a dataframe, and for each row in that dataframe I have to do some complicated lookups and append some data to a file.

The dataFrame contains scientific results for selected wells from 96 well plates used in biological research so I want to do something like:

for (well in dataFrame) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

In my procedural world, I'd do something like:

for (row in dataFrame) {
    #look up stuff using data from the row
    #write stuff to the file
}

What is the "R way" to do this?

What is your question here? A data.frame is a two-dimensional object and looping over the rows is a perfectly normal way of doing things as rows are commonly sets of 'observations' of the 'variables' in each column.
what I end up doing is: for (index in 1:nrow(dataFrame)) { row = dataFrame[index, ]; # do stuff with the row } which never seemed very pretty to me.
Does getWellID call a database or anything? Otherwise, Jonathan is probably right and you could vectorize this.

K
Ken Williams

You can use the by() function:

by(dataFrame, seq_len(nrow(dataFrame)), function(row) dostuff)

But iterating over the rows directly like this is rarely what you want to; you should try to vectorize instead. Can I ask what the actual work in the loop is doing?


this will not work well if the data frame has 0 rows because 1:0 is not empty
Easy fix for the 0 row case is to use seq_len(), insert seq_len(nrow(dataFrame)) in place of 1:nrow(dataFrame).
How do you actually implement (row)? Is it dataframe$column? dataframe[somevariableNamehere]? How do you actually say its a row. The pseudocode "function(row) dostuff" how would that actually look?
@Mike, change dostuff in this answer to str(row) You'll see multiple lines printed in the console beginning with " 'data.frame': 1 obs of x variables." But be careful, changing dostuff to row does not return a data.frame object for the outer function as a whole. Instead it returns a list of one row data-frames.
Not everything should be vectorized. But in this case it would make sense I guess.
U
Uli Köhler

You can try this, using apply() function

> d
  name plate value1 value2
1    A    P1      1    100
2    B    P2      2    200
3    C    P3      3    300

> f <- function(x, output) {
 wellName <- x[1]
 plateName <- x[2]
 wellID <- 1
 print(paste(wellID, x[3], x[4], sep=","))
 cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

> apply(d, 1, f, output = 'outputfile')

Be careful, as the dataframe is converted to a matrix, and what you end up with (x) is a vector. This is why the above example has to use numeric indexes; the by() approach gives you a data.frame, which makes your code more robust.
did not work for me. The apply function treated every x given to f as a character value and not a row.
Note too that you can refer to the columns by name. So: wellName <- x[1] could also be wellName <- x["name"].
When Darren mentioned robust, he meant something like shifting the orders of the columns. This answer would not work whereas the one with by() would still work.
S
Shane

First, Jonathan's point about vectorizing is correct. If your getWellID() function is vectorized, then you can skip the loop and just use cat or write.csv:

write.csv(data.frame(wellid=getWellID(well$name, well$plate), 
         value1=well$value1, value2=well$value2), file=outputFile)

If getWellID() isn't vectorized, then Jonathan's recommendation of using by or knguyen's suggestion of apply should work.

Otherwise, if you really want to use for, you can do something like this:

for(i in 1:nrow(dataFrame)) {
    row <- dataFrame[i,]
    # do stuff with row
}

You can also try to use the foreach package, although it requires you to become familiar with that syntax. Here's a simple example:

library(foreach)
d <- data.frame(x=1:10, y=rnorm(10))
s <- foreach(d=iter(d, by='row'), .combine=rbind) %dopar% d

A final option is to use a function out of the plyr package, in which case the convention will be very similar to the apply function.

library(plyr)
ddply(dataFrame, .(x), function(x) { # do stuff })

Shane, thank you. I'm not sure how to write a vectorized getWellID. What I need to do right now is to dig into an existing list of lists to look it up or pull it out of a database.
Feel free to post the getWellID question (i.e. can this function be vectorized?) separately, and I'm sure I (or someone else) will answer it.
Even if getWellID is not vectorized, I think you should go with this solution, and replace getWellId with mapply(getWellId, well$name, well$plate).
Even if you pull it from a database, you can pull them all at once and then filter the result in R; that will be faster than an iterative function.
+1 for foreach - I'm going to use the hell out of that one.
C
Capt.Krusty

I think the best way to do this with basic R is:

for( i in rownames(df) )
   print(df[i, "column1"])

The advantage over the for( i in 1:nrow(df))-approach is that you do not get into trouble if df is empty and nrow(df)=0.


Ł
Ł Łaniewski-Wołłk

I use this simple utility function:

rows = function(tab) lapply(
  seq_len(nrow(tab)),
  function(i) unclass(tab[i,,drop=F])
)

Or a faster, less clear form:

rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,"[",i))

This function just splits a data.frame to a list of rows. Then you can make a normal "for" over this list:

tab = data.frame(x = 1:3, y=2:4, z=3:5)
for (A in rows(tab)) {
    print(A$x + A$y * A$z)
}        

Your code from the question will work with a minimal modification:

for (well in rows(dataFrame)) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

It's faster to access a straight list then a data.frame.
Just realized it's even faster to make the same thing with double lapply: rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,function(c) c[i]))
So the inner lapply iterates over the columns of the entire dataset x, giving each column the name c, and then extracting the ith entry from that column vector. Is this correct?
Very nice! In my case, I had to convert from "factor" values to the underlying value: wellName <- as.character(well$name).
F
Ferran E

I was curious about the time performance of the non-vectorised options. For this purpose, I have used the function f defined by knguyen

f <- function(x, output) {
  wellName <- x[1]
  plateName <- x[2]
  wellID <- 1
  print(paste(wellID, x[3], x[4], sep=","))
  cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

and a dataframe like the one in his example:

n = 100; #number of rows for the data frame
d <- data.frame( name = LETTERS[ sample.int( 25, n, replace=T ) ],
                  plate = paste0( "P", 1:n ),
                  value1 = 1:n,
                  value2 = (1:n)*10 )

I included two vectorised functions (for sure quicker than the others) in order to compare the cat() approach with a write.table() one...

library("ggplot2")
library( "microbenchmark" )
library( foreach )
library( iterators )

tm <- microbenchmark(S1 =
                       apply(d, 1, f, output = 'outputfile1'),
                     S2 = 
                       for(i in 1:nrow(d)) {
                         row <- d[i,]
                         # do stuff with row
                         f(row, 'outputfile2')
                       },
                     S3 = 
                       foreach(d1=iter(d, by='row'), .combine=rbind) %dopar% f(d1,"outputfile3"),
                     S4= {
                       print( paste(wellID=rep(1,n), d[,3], d[,4], sep=",") )
                       cat( paste(wellID=rep(1,n), d[,3], d[,4], sep=","), file= 'outputfile4', sep='\n',append=T, fill = F)                           
                     },
                     S5 = {
                       print( (paste(wellID=rep(1,n), d[,3], d[,4], sep=",")) )
                       write.table(data.frame(rep(1,n), d[,3], d[,4]), file='outputfile5', row.names=F, col.names=F, sep=",", append=T )
                     },
                     times=100L)
autoplot(tm)

https://i.stack.imgur.com/hHQ3M.png


R
RobinL

You can use the by_row function from the package purrrlyr for this:

myfn <- function(row) {
  #row is a tibble with one row, and the same 
  #number of columns as the original df
  #If you'd rather it be a list, you can use as.list(row)
}

purrrlyr::by_row(df, myfn)

By default, the returned value from myfn is put into a new list column in the df called .out.

If this is the only output you desire, you could write purrrlyr::by_row(df, myfn)$.out


A
Amogh Borkar

Well, since you asked for R equivalent to other languages, I tried to do this. Seems to work though I haven't really looked at which technique is more efficient in R.

> myDf <- head(iris)
> myDf
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nRowsDf <- nrow(myDf)
> for(i in 1:nRowsDf){
+ print(myDf[i,4])
+ }
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.4

For the categorical columns though, it would fetch you a Data Frame which you could typecast using as.character() if needed.


S
Seyma Kalay

you can do something for a list object,

data("mtcars")
rownames(mtcars)
data <- list(mtcars ,mtcars, mtcars, mtcars);data

out1 <- NULL 
for(i in seq_along(data)) { 
  out1[[i]] <- data[[i]][rownames(data[[i]]) != "Volvo 142E", ] } 
out1

Or a data frame,

data("mtcars")
df <- mtcars
out1 <- NULL 
for(i in 1:nrow(df)) {
  row <- rownames(df[i,])
  # do stuff with row
  out1 <- df[rownames(df) != "Volvo 142E",]
  
}
out1