I have a dataframe with multiple columns. For each row in the dataframe, I want to call a function on the row, and the input of the function is using multiple columns from that row. For example, let's say I have this data and this testFunc which accepts two args:
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> testFunc <- function(a, b) a + b
Let's say I want to apply this testFunc to columns x and z. So, for row 1 I want 1+5, and for row 2 I want 2 + 6. Is there a way to do this without writing a for loop, maybe with the apply function family?
I tried this:
> df[,c('x','z')]
x z
1 1 5
2 2 6
> lapply(df[,c('x','z')], testFunc)
Error in a + b : 'b' is missing
But got error, any ideas?
EDIT: the actual function I want to call is not a simple sum, but it is power.t.test. I used a+b just for example purposes. The end goal is to be able to do something like this (written in pseudocode):
df = data.frame(
delta=c(delta_values),
power=c(power_values),
sig.level=c(sig.level_values)
)
lapply(df, power.t.test(delta_from_each_row_of_df,
power_from_each_row_of_df,
sig.level_from_each_row_of_df
))
where the result is a vector of outputs for power.t.test for each row of df.
dplyr
way.
You can apply apply
to a subset of the original data.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
apply(dat[,c('x','z')], 1, function(x) sum(x) )
or if your function is just sum use the vectorized version:
rowSums(dat[,c('x','z')])
[1] 6 8
If you want to use testFunc
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(x) testFunc(x[1],x[2]))
EDIT To access columns by name and not index you can do something like this:
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(y) testFunc(y['z'],y['x']))
A data.frame
is a list
, so ...
For vectorized functions do.call
is usually a good bet. But the names of arguments come into play. Here your testFunc
is called with args x and y in place of a and b. The ...
allows irrelevant args to be passed without causing an error:
do.call( function(x,z,...) testFunc(x,z), df )
For non-vectorized functions, mapply
will work, but you need to match the ordering of the args or explicitly name them:
mapply(testFunc, df$x, df$z)
Sometimes apply
will work - as when all args are of the same type so coercing the data.frame
to a matrix does not cause problems by changing data types. Your example was of this sort.
If your function is to be called within another function into which the arguments are all passed, there is a much slicker method than these. Study the first lines of the body of lm()
if you want to go that route.
Vectorize
as a wrapper to mapply
to vectorize functions
Use mapply
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> mapply(function(x,y) x+y, df$x, df$z)
[1] 6 8
> cbind(df,f = mapply(function(x,y) x+y, df$x, df$z) )
x y z f
1 1 3 5 6
2 2 4 6 8
New answer with dplyr package
If the function that you want to apply is vectorized, then you could use the mutate
function from the dplyr
package:
> library(dplyr)
> myf <- function(tens, ones) { 10 * tens + ones }
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mutate(x, value = myf(tens, ones))
hundreds tens ones value
1 7 1 4 14
2 8 2 5 25
3 9 3 6 36
Old answer with plyr package
In my humble opinion, the tool best suited to the task is mdply
from the plyr
package.
Example:
> library(plyr)
> x <- data.frame(tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
tens ones V1
1 1 4 14
2 2 5 25
3 3 6 36
Unfortunately, as Bertjan Broeksema pointed out, this approach fails if you don't use all the columns of the data frame in the mdply
call. For example,
> library(plyr)
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
Error in (function (tens, ones) : unused argument (hundreds = 7)
dplyr::mutate_each
. For example: iris %>% mutate_each(funs(half = . / 2),-Species)
.
Others have correctly pointed out that mapply
is made for this purpose, but (for the sake of completeness) a conceptually simpler method is just to use a for
loop.
for (row in 1:nrow(df)) {
df$newvar[row] <- testFunc(df$x[row], df$z[row])
}
Many functions are vectorization already, and so there is no need for any iterations (neither for
loops or *pply
functions). Your testFunc
is one such example. You can simply call:
testFunc(df[, "x"], df[, "z"])
In general, I would recommend trying such vectorization approaches first and see if they get you your intended results.
Alternatively, if you need to pass multiple arguments to a function which is not vectorized, mapply
might be what you are looking for:
mapply(power.t.test, df[, "x"], df[, "z"])
Here is an alternate approach. It is more intuitive.
One key aspect I feel some of the answers did not take into account, which I point out for posterity, is apply() lets you do row calculations easily, but only for matrix (all numeric) data
operations on columns are possible still for dataframes:
as.data.frame(lapply(df, myFunctionForColumn()))
To operate on rows, we make the transpose first.
tdf<-as.data.frame(t(df))
as.data.frame(lapply(tdf, myFunctionForRow()))
The downside is that I believe R will make a copy of your data table. Which could be a memory issue. (This is truly sad, because it is programmatically simple for tdf to just be an iterator to the original df, thus saving memory, but R does not allow pointer or iterator referencing.)
Also, a related question, is how to operate on each individual cell in a dataframe.
newdf <- as.data.frame(lapply(df, function(x) {sapply(x, myFunctionForEachCell()}))
data.table
has a really intuitive way of doing this as well:
library(data.table)
sample_fxn = function(x,y,z){
return((x+y)*z)
}
df = data.table(A = 1:5,B=seq(2,10,2),C = 6:10)
> df
A B C
1: 1 2 6
2: 2 4 7
3: 3 6 8
4: 4 8 9
5: 5 10 10
The :=
operator can be called within brackets to add a new column using a function
df[,new_column := sample_fxn(A,B,C)]
> df
A B C new_column
1: 1 2 6 18
2: 2 4 7 42
3: 3 6 8 72
4: 4 8 9 108
5: 5 10 10 150
It's also easy to accept constants as arguments as well using this method:
df[,new_column2 := sample_fxn(A,B,2)]
> df
A B C new_column new_column2
1: 1 2 6 18 6
2: 2 4 7 42 12
3: 3 6 8 72 18
4: 4 8 9 108 24
5: 5 10 10 150 30
df[,new_column := Vectorize(sample_fxn)(A,B,C)]
@user20877984's answer is excellent. Since they summed it up far better than my previous answer, here is my (posibly still shoddy) attempt at an application of the concept:
Using do.call
in a basic fashion:
powvalues <- list(power=0.9,delta=2)
do.call(power.t.test,powvalues)
Working on a full data set:
# get the example data
df <- data.frame(delta=c(1,1,2,2), power=c(.90,.85,.75,.45))
#> df
# delta power
#1 1 0.90
#2 1 0.85
#3 2 0.75
#4 2 0.45
lapply
the power.t.test
function to each of the rows of specified values:
result <- lapply(
split(df,1:nrow(df)),
function(x) do.call(power.t.test,x)
)
> str(result)
List of 4
$ 1:List of 8
..$ n : num 22
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.9
..$ alternative: chr "two.sided"
..$ note : chr "n is number in *each* group"
..$ method : chr "Two-sample t test power calculation"
..- attr(*, "class")= chr "power.htest"
$ 2:List of 8
..$ n : num 19
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.85
... ...
2
, why not just apply over 1
?
I came here looking for tidyverse function name - which I knew existed. Adding this for (my) future reference and for tidyverse
enthusiasts: purrrlyr:invoke_rows
(purrr:invoke_rows
in older versions).
With connection to standard stats methods as in the original question, the broom package would probably help.
If data.frame columns are different types, apply()
has a problem. A subtlety about row iteration is how apply(a.data.frame, 1, ...)
does implicit type conversion to character types when columns are different types; eg. a factor and numeric column. Here's an example, using a factor in one column to modify a numeric column:
mean.height = list(BOY=69.5, GIRL=64.0)
subjects = data.frame(gender = factor(c("BOY", "GIRL", "GIRL", "BOY"))
, height = c(71.0, 59.3, 62.1, 62.1))
apply(height, 1, function(x) x[2] - mean.height[[x[1]]])
The subtraction fails because the columns are converted to character types.
One fix is to back-convert the second column to a number:
apply(subjects, 1, function(x) as.numeric(x[2]) - mean.height[[x[1]]])
But the conversions can be avoided by keeping the columns separate and using mapply()
:
mapply(function(x,y) y - mean.height[[x]], subjects$gender, subjects$height)
mapply()
is needed because [[ ]]
does not accept a vector argument. So the column iteration could be done before the subtraction by passing a vector to []
, by a bit more ugly code:
subjects$height - unlist(mean.height[subjects$gender])
A really nice function for this is adply
from plyr
, especially if you want to append the result to the original dataframe. This function and its cousin ddply
have saved me a lot of headaches and lines of code!
df_appended <- adply(df, 1, mutate, sum=x+z)
Alternatively, you can call the function you desire.
df_appended <- adply(df, 1, mutate, sum=testFunc(x,z))
Success story sharing
apply
on big data.frames it will copy the entire object (to convert to a matrix). This will also cause problems If you have different class objects within the data.frame.