I have a data frame with two columns. First column contains categories such as "First", "Second", "Third", and the second column has numbers that represent the number of times I saw the specific groups from "Category".
For example:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
I want to sort the data by Category and sum all the Frequencies:
Category Frequency
First 30
Second 5
Third 34
How would I do this in R?
rowsum
.
Using aggregate
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
In the example above, multiple dimensions can be specified in the list
. Multiple aggregated metrics of the same data type can be incorporated via cbind
:
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
(embedding @thelatemail comment), aggregate
has a formula interface too
aggregate(Frequency ~ Category, x, sum)
Or if you want to aggregate multiple columns, you could use the .
notation (works for one column too)
aggregate(. ~ Category, x, sum)
or tapply
:
tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
Using this data:
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
You can also use the dplyr package for that purpose:
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
Or, for multiple summary columns (works with one column too):
x %>%
group_by(Category) %>%
summarise(across(everything(), sum))
Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars
:
# several summary columns with arbitrary names
mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
# summarise all columns except grouping columns using "sum"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), sum))
# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# multiple grouping columns
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# summarise specific variables, not all
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))
# summarise specific variables (numeric columns except grouping columns)
mtcars %>%
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
For more information, including the %>%
operator, see the introduction to dplyr.
funs()
argument of summarise_all
and its related functions (summarise_at
, summarise_if
)
The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
Let's compare that to the same thing using data.frame and the above above:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
And if you want to keep the column this is the syntax:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
The difference will become more noticeable with larger datasets, as the code below demonstrates:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
For multiple aggregations, you can combine lapply
and .SD
as follows
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
data[, sum(Frequency), by = Category]
. You could use .N
which substitutes the sum()
function. data[, .N, by = Category]
. Here is a useful cheatsheet: s3.amazonaws.com/assets.datacamp.com/img/blog/…
You can also use the by() function:
x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.
Several years later, just to add another simple base R solution that isn't present here for some reason- xtabs
xtabs(Frequency ~ Category, df)
# Category
# First Second Third
# 30 5 34
Or if you want a data.frame
back
as.data.frame(xtabs(Frequency ~ Category, df))
# Category Freq
# 1 First 30
# 2 Second 5
# 3 Third 34
library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
If x
is a dataframe with your data, then the following will do what you want:
require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
While I have recently become a convert to dplyr
for most of these types of operations, the sqldf
package is still really nice (and IMHO more readable) for some things.
Here is an example of how this question can be answered with sqldf
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")
## Category Frequency
## 1 First 30
## 2 Second 5
## 3 Third 34
Just to add a third option:
require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
EDIT: this is a very old answer. Now I would recommend the use of group_by
and summarise
from dplyr
, as in @docendo answer.
Another solution that returns sums by groups in a matrix or a data frame and is short and fast:
rowsum(x$Frequency, x$Category)
I find ave
very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :
e.g.
Given this input :
DF <-
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
Categ2=factor(c('X','Y','X','X','X','Y','Y')),
Samples=c(1,2,4,3,5,6,7),
Freq=c(10,30,45,55,80,65,50))
> DF
Categ1 Categ2 Samples Freq
1 A X 1 10
2 A Y 2 30
3 B X 4 45
4 B X 3 55
5 A X 5 80
6 B Y 6 65
7 A Y 7 50
we want to group by Categ1
and Categ2
and compute the sum of Samples
and mean of Freq
.
Here's a possible solution using ave
:
# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]
# add sum of Samples by Categ1,Categ2 to DF2
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)
# add mean of Freq by Categ1,Categ2 to DF2
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)
# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]
Result :
> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1 A X 6 45
2 A Y 9 40
3 B X 7 50
6 B Y 6 65
You could use the function group.sum
from package Rfast.
Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
Rfast has many group functions and group.sum
is one of them.
Since dplyr 1.0.0
, the across()
function could be used:
df %>%
group_by(Category) %>%
summarise(across(Frequency, sum))
Category Frequency
<chr> <int>
1 First 30
2 Second 5
3 Third 34
If interested in multiple variables:
df %>%
group_by(Category) %>%
summarise(across(c(Frequency, Frequency2), sum))
Category Frequency Frequency2
<chr> <int> <int>
1 First 30 55
2 Second 5 29
3 Third 34 190
And the selection of variables using select helpers:
df %>%
group_by(Category) %>%
summarise(across(starts_with("Freq"), sum))
Category Frequency Frequency2 Frequency3
<chr> <int> <int> <dbl>
1 First 30 55 110
2 Second 5 29 58
3 Third 34 190 380
Sample data:
df <- read.table(text = "Category Frequency Frequency2 Frequency3
1 First 10 10 20
2 First 15 30 60
3 First 5 15 30
4 Second 2 8 16
5 Third 14 70 140
6 Third 20 120 240
7 Second 3 21 42",
header = TRUE,
stringsAsFactors = FALSE)
using cast
instead of recast
(note 'Frequency'
is now 'value'
)
df <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))
install.packages("reshape")
result<-cast(df, Category ~ . ,fun.aggregate=sum)
to get:
Category (all)
First 30
Second 5
Third 34
library(tidyverse)
x <- data.frame(Category= c('First', 'First', 'First', 'Second', 'Third', 'Third', 'Second'),
Frequency = c(10, 15, 5, 2, 14, 20, 3))
count(x, Category, wt = Frequency)
A good way to sum a variable by group is
rowsum(numericToBeSummedUp, groups)
from base. Here only collapse::fsum
and Rfast::group.sum
have been faster.
Regarding speed and memory consumption
collapse::fsum(numericToBeSummedUp, groups)
was the best in the given example which could be speed up when using a grouped data frame.
GDF <- collapse::fgroup_by(DF, g) #Create a grouped data.frame with group g
#GDF <- collapse::gby(DF, g) #Alternative
collapse::fsum(GDF) #Calculate sum per group
Which comes close to the timings when the dataset was split in subdatasets per group.
A benchmark on different methods shows that for summing up a single column collapse::fsum
was two times faster than Rfast::group.sum
and 7 times faster than rowsum
. They were followed by tapply
, data.table
, by
and dplyr
. xtabs
and aggregate
are the slowest.
Aggregating two columns collapse::fsum
is again the fastest, 3 times faster than Rfast::group.sum
and 5 times faster then rowsum
. They are followed by data.table
, tapply
, by
and dplyr
. Again xtabs
and aggregate
are the slowest.
Benchmark
set.seed(42)
n <- 1e5
DF <- data.frame(g = as.factor(sample(letters, n, TRUE))
, x = rnorm(n), y = rnorm(n) )
library(magrittr)
Some methods allow to do tasks which might help to speed up the aggregation.
DT <- data.table::as.data.table(DF)
data.table::setkey(DT, g)
DFG <- collapse::gby(DF, g)
DFG1 <- collapse::gby(DF[c("g", "x")], g)
# Optimized dataset for this aggregation task
# This will also consume time!
DFS <- lapply(split(DF[c("x", "y")], DF["g"]), as.matrix)
DFS1 <- lapply(split(DF["x"], DF["g"]), as.matrix)
Summing up one column.
bench::mark(check = FALSE
, "aggregate" = aggregate(DF$x, DF["g"], sum)
, "tapply" = tapply(DF$x, DF$g, sum)
, "dplyr" = DF %>% dplyr::group_by(g) %>% dplyr::summarise(sum = sum(x))
, "data.table" = data.table::as.data.table(DF)[, sum(x), by = g]
, "data.table2" = DT[, sum(x), by = g]
, "by" = by(DF$x, DF$g, sum)
, "xtabs" = xtabs(x ~ g, DF)
, "rowsum" = rowsum(DF$x, DF$g)
, "Rfast" = Rfast::group.sum(DF$x, DF$g)
, "base Split" = lapply(DFS1, colSums)
, "base Split Rfast" = lapply(DFS1, Rfast::colsums)
, "collapse" = collapse::fsum(DF$x, DF$g)
, "collapse2" = collapse::fsum(DFG1)
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
# 1 aggregate 20.43ms 21.88ms 45.7 16.07MB 59.4 10 13
# 2 tapply 1.24ms 1.39ms 687. 1.53MB 30.1 228 10
# 3 dplyr 3.28ms 4.81ms 209. 2.42MB 13.1 96 6
# 4 data.table 1.59ms 2.47ms 410. 4.69MB 87.7 145 31
# 5 data.table2 1.52ms 1.93ms 514. 2.38MB 40.5 190 15
# 6 by 2.15ms 2.31ms 396. 2.29MB 26.7 148 10
# 7 xtabs 7.78ms 8.91ms 111. 10.54MB 50.0 31 14
# 8 rowsum 951.36µs 1.07ms 830. 1.15MB 24.1 378 11
# 9 Rfast 431.06µs 434.53µs 2268. 2.74KB 0 1134 0
#10 base Split 213.42µs 219.66µs 4342. 256B 12.4 2105 6
#11 base Split Rfast 76.88µs 81.48µs 10923. 65.05KB 16.7 5232 8
#12 collapse 121.03µs 122.92µs 7965. 256B 2.01 3961 1
#13 collapse2 85.97µs 88.67µs 10749. 256B 4.03 5328 2
Summing up two columns
bench::mark(check = FALSE
, "aggregate" = aggregate(DF[c("x", "y")], DF["g"], sum)
, "tapply" = list2DF(lapply(DF[c("x", "y")], tapply, list(DF$g), sum))
, "dplyr" = DF %>% dplyr::group_by(g) %>% dplyr::summarise(x = sum(x), y = sum(y))
, "data.table" = data.table::as.data.table(DF)[,.(sum(x),sum(y)), by = g]
, "data.table2" = DT[,.(sum(x),sum(y)), by = g]
, "by" = lapply(DF[c("x", "y")], by, list(DF$g), sum)
, "xtabs" = xtabs(cbind(x, y) ~ g, DF)
, "rowsum" = rowsum(DF[c("x", "y")], DF$g)
, "Rfast" = list2DF(lapply(DF[c("x", "y")], Rfast::group.sum, DF$g))
, "base Split" = lapply(DFS, colSums)
, "base Split Rfast" = lapply(DFS, Rfast::colsums)
, "collapse" = collapse::fsum(DF[c("x", "y")], DF$g)
, "collapse2" = collapse::fsum(DFG)
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
# 1 aggregate 25.87ms 26.36ms 37.7 20.89MB 132. 4 14
# 2 tapply 2.65ms 3.23ms 312. 3.06MB 22.5 97 7
# 3 dplyr 4.27ms 6.02ms 164. 3.19MB 13.3 74 6
# 4 data.table 2.33ms 3.19ms 309. 4.72MB 57.0 114 21
# 5 data.table2 2.22ms 2.81ms 355. 2.41MB 19.8 161 9
# 6 by 4.45ms 5.23ms 190. 4.59MB 22.5 59 7
# 7 xtabs 10.71ms 13.14ms 76.1 19.7MB 145. 11 21
# 8 rowsum 1.02ms 1.07ms 850. 1.15MB 23.8 393 11
# 9 Rfast 841.57µs 846.88µs 1150. 5.48KB 0 575 0
#10 base Split 360.24µs 368.28µs 2652. 256B 8.16 1300 4
#11 base Split Rfast 113.95µs 119.81µs 7540. 65.05KB 10.3 3661 5
#12 collapse 201.31µs 204.83µs 4724. 512B 2.01 2350 1
#13 collapse2 156.95µs 161.79µs 5408. 512B 2.02 2683 1
n
to 1e7
and re-ran the benchmark for the top performers. Mostly the same order, rowsum
is unbeatable, with data.table2
in second and dplyr
not far behind. On data that big, dplyr
actually beats data.table
with the class conversion in the benchmark.
collapse::fsum
is also fast, at least on larger data with more groups. set.seed(42)
; n <- 1e7
; DF <- data.frame(g = as.factor(sample(1e4, n, TRUE)), x = rnorm(n), y = rnorm(n))
; system.time(group.sum(DF$x, DF$g))
; system.time(fsum(DF$x, DF$g))
gr = GRP(DF, ~ g)
; fsum(DF, gr)
.
collapse::fsum
which is currently the fastest.
Success story sharing
1:nrow(df)
, is it possible to keep the starting position of each category after aggregating? So the ID column would end up as, for example, 1, 3, 4, 7 after collapsing with aggregate. In my case I likeaggregate
because it works over many columns automatically.