I wonder how to add regression line equation and R^2 on the ggplot
. My code is:
library(ggplot2)
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_point()
p
Any help will be highly appreciated.
latticeExtra::lmlineq()
.
Error: 'lmlineq' is not an exported object from 'namespace:latticeExtra'
Here is one solution
# GET EQUATION AND R-SQUARED AS STRING
# SOURCE: https://groups.google.com/forum/#!topic/ggplot2/1TgH-kG5XMA
lm_eqn <- function(df){
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(unname(coef(m)[1]), digits = 2),
b = format(unname(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
p1 <- p + geom_text(x = 25, y = 300, label = lm_eqn(df), parse = TRUE)
EDIT. I figured out the source from where I picked this code. Here is the link to the original post in the ggplot2 google groups
https://i.stack.imgur.com/7aq55.png
Statistic stat_poly_eq()
in my package ggpmisc
makes it possible add text labels based on a linear model fit.
This answer has been updated for 'ggpmisc' (>= 0.4.0) and 'ggplot2' (>= 3.3.0) on 2022-06-02. In the examples I use stat_poly_line()
instead of stat_smooth()
as it has the same defaults as stat_poly_eq()
for method
and formula
. I have omitted in all code examples the additional arguments to stat_poly_line()
as they are irrelevant to the question of adding labels.
library(ggplot2)
library(ggpmisc)
#> Loading required package: ggpp
#>
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#>
#> annotate
# artificial data
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
df$yy <- 2 + 3 * df$x + 0.1 * df$x^2 + rnorm(100, sd = 40)
# using default formula, label and methods
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq() +
geom_point()
https://i.imgur.com/RDZ2XTj.png
# assembling a single label with equation and R2
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(aes(label = paste(after_stat(eq.label),
after_stat(rr.label), sep = "*\", \"*"))) +
geom_point()
https://i.imgur.com/1moN0zF.png
# adding separate labels with equation and R2
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(aes(label = after_stat(eq.label))) +
stat_poly_eq(label.y = 0.9) +
geom_point()
https://i.imgur.com/Gv3qsgl.png
# regression through the origin
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line(formula = y ~ x + 0) +
stat_poly_eq(formula = y ~ x + 0, aes(label = after_stat(eq.label))) +
geom_point()
https://i.imgur.com/xux4jvx.png
# fitting a polynomial
ggplot(data = df, aes(x = x, y = yy)) +
stat_poly_line(formula = y ~ poly(x, 2, raw = TRUE)) +
stat_poly_eq(formula = y ~ poly(x, 2, raw = TRUE),
aes(label = after_stat(eq.label))) +
geom_point()
https://i.imgur.com/ss2mXll.png
# adding a hat as asked by @MYaseen208 and @elarry
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(eq.with.lhs = "italic(hat(y))~`=`~",
aes(label = paste(after_stat(eq.label),
after_stat(rr.label), sep = "*\", \"*"))) +
geom_point()
https://i.imgur.com/Eb2kLV9.png
# variable substitution as asked by @shabbychef
# same labels in equation and axes
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(eq.with.lhs = "italic(h)~`=`~",
eq.x.rhs = "~italic(z)",
aes(label = after_stat(eq.label))) +
labs(x = expression(italic(z)), y = expression(italic(h))) +
geom_point()
https://i.imgur.com/S6xxz2v.png
# grouping as asked by @helen.h
dfg <- data.frame(x = c(1:100))
dfg$y <- 20 * c(0, 1) + 3 * df$x + rnorm(100, sd = 40)
dfg$group <- factor(rep(c("A", "B"), 50))
ggplot(data = dfg, aes(x = x, y = y, colour = group)) +
stat_poly_line() +
stat_poly_eq(aes(label = paste(after_stat(eq.label),
after_stat(rr.label), sep = "*\", \"*"))) +
geom_point()
https://i.imgur.com/lTEynGy.png
ggplot(data = dfg, aes(x = x, y = y, linetype = group, grp.label = group)) +
stat_poly_line() +
stat_poly_eq(aes(label = paste(after_stat(grp.label), "*\": \"*",
after_stat(eq.label), "*\", \"*",
after_stat(rr.label), sep = ""))) +
geom_point()
https://i.imgur.com/TfBBBCu.png
# a single fit with grouped data as asked by @Herman
ggplot(data = dfg, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(aes(label = paste(after_stat(eq.label),
after_stat(rr.label), sep = "*\", \"*"))) +
geom_point(aes(colour = group))
https://i.imgur.com/BlWBPRk.png
# facets
ggplot(data = dfg, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(aes(label = paste(after_stat(eq.label),
after_stat(rr.label), sep = "*\", \"*"))) +
geom_point() +
facet_wrap(~group)
https://i.imgur.com/6BIxWGj.png
Created on 2022-06-02 by the reprex package (v2.0.1)
x
and y
in the formula refer to the x
and y
data in the layers of the plot, and not necessarily to those in scope at the time my.formula
is constructed. Thus the formula should always use x and y variables?
aes(label = paste(..eq.label.., ..rr.label.., sep = "*plain(\",\")~"))
does the job.
stat_poly_eq()
. You can use stat_fit_glance()
, also from package 'ggpmisc', which returns R2 as a numeric value. See examples in the help page, and replace stat(r.squared)
by sqrt(stat(r.squared))
.
I changed a few lines of the source of stat_smooth
and related functions to make a new function that adds the fit equation and R squared value. This will work on facet plots too!
library(devtools)
source_gist("524eade46135f6348140")
df = data.frame(x = c(1:100))
df$y = 2 + 5 * df$x + rnorm(100, sd = 40)
df$class = rep(1:2,50)
ggplot(data = df, aes(x = x, y = y, label=y)) +
stat_smooth_func(geom="text",method="lm",hjust=0,parse=TRUE) +
geom_smooth(method="lm",se=FALSE) +
geom_point() + facet_wrap(~class)
https://i.stack.imgur.com/onecx.png
I used the code in @Ramnath's answer to format the equation. The stat_smooth_func
function isn't very robust, but it shouldn't be hard to play around with it.
https://gist.github.com/kdauria/524eade46135f6348140. Try updating ggplot2
if you get an error.
stat_smooth_func(mapping=aes(group=cut(x.val,c(-70,-20,0,20,50,130))),geom="text",method="lm",hjust=0,parse=TRUE)
, in combination with EvaluateSmooths from stackoverflow.com/questions/19735149/…
source
the entire file in your script.
xpos
and ypos
arguments of the function in the Gist. So if you wanted all the equations to overlap, just set xpos
and ypos
. Otherwise, xpos
and ypos
are calculated from the data. If you want something fancier, it shouldn't be too hard to add some logic inside the function. For example, maybe you could write a function to determine what part of the graph has the most empty space and put the function there.
I've modified Ramnath's post to a) make more generic so it accepts a linear model as a parameter rather than the data frame and b) displays negatives more appropriately.
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
}
as.character(as.expression(eq));
}
Usage would change to:
p1 = p + geom_text(aes(x = 25, y = 300, label = lm_eqn(lm(y ~ x, df))), parse = TRUE)
p1 = p + annotate("text", x = 25, y = 300, label = lm_eqn(lm(y ~ x, df)), colour="black", size = 5, parse=TRUE)
edit: this also resolves any issues you might have with letters showing up in your legend.
"cannot coerce class "lm" to a data.frame"
. This alternative works: df.labs <- data.frame(x = 25, y = 300, label = lm_eqn(df))
and p <- p + geom_text(data = df.labs, aes(x = x, y = y, label = label), parse = TRUE)
lm_eqn(lm(...))
with Ramnath's solution. You probably tried this one after trying that one but forgot to ensure that you had redefined lm_eqn
Here's the most simplest code for everyone
Note: Showing Pearson's Rho and not R^2.
library(ggplot2)
library(ggpubr)
df <- data.frame(x = c(1:100)
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_point()+
stat_cor(label.y = 35)+ #this means at 35th unit in the y axis, the r squared and p value will be shown
stat_regline_equation(label.y = 30) #this means at 30th unit regresion line equation will be shown
p
https://i.stack.imgur.com/f4FGl.png
stat_cor(aes(label = ..rr.label..))
stat_regline_equation()
to plot the regression equation
stat_regline_equation()
and in stat_cor()
was just copied without acknowledgement from my package 'ggpmisc'. It was taken from stat_poly_eq()
which is actively maintained and has gained several new features since it was copied. The example code needs minimal edits to work with 'ggpmisc'.
stat_poly_eq()
to be more cumbersome. The fact that I can't easily separate the label=..eq.label..
and label=..rr.label..
onto separate lines and intuitively place them on the grid means I will continue to prefer stat_cor()
and stat_regline_equation()
. This is certainly my own personal opinion and may not be shared by other users but just some things to consider since you're still actively updating ggpmisc
Using ggpubr:
library(ggpubr)
# reproducible data
set.seed(1)
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
# By default showing Pearson R
ggscatter(df, x = "x", y = "y", add = "reg.line") +
stat_cor(label.y = 300) +
stat_regline_equation(label.y = 280)
https://i.stack.imgur.com/Xb8KN.png
# Use R2 instead of R
ggscatter(df, x = "x", y = "y", add = "reg.line") +
stat_cor(label.y = 300,
aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~"))) +
stat_regline_equation(label.y = 280)
## compare R2 with accepted answer
# m <- lm(y ~ x, df)
# round(summary(m)$r.squared, 2)
# [1] 0.85
https://i.stack.imgur.com/lbtt5.png
label.y
?
label.y = max(df$y) * 0.8
really love @Ramnath solution. To allow use to customize the regression formula (instead of fixed as y and x as literal variable names), and added the p-value into the printout as well (as @Jerry T commented), here is the mod:
lm_eqn <- function(df, y, x){
formula = as.formula(sprintf('%s ~ %s', y, x))
m <- lm(formula, data=df);
# formating the values into a summary string to print out
# ~ give some space, but equal size and comma need to be quoted
eq <- substitute(italic(target) == a + b %.% italic(input)*","~~italic(r)^2~"="~r2*","~~p~"="~italic(pvalue),
list(target = y,
input = x,
a = format(as.vector(coef(m)[1]), digits = 2),
b = format(as.vector(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3),
# getting the pvalue is painful
pvalue = format(summary(m)$coefficients[2,'Pr(>|t|)'], digits=1)
)
)
as.character(as.expression(eq));
}
geom_point() +
ggrepel::geom_text_repel(label=rownames(mtcars)) +
geom_text(x=3,y=300,label=lm_eqn(mtcars, 'hp','wt'),color='red',parse=T) +
geom_smooth(method='lm')
https://i.stack.imgur.com/m3G4L.png
ggplot(mtcars, aes(x = wt, y = mpg, group=cyl))+
before the geom_point()? A semi-related question - if we refer to hp and wt in the aes()
for ggplot, can we then grab them to use in the call to lm_eqn
, so then we only have to code in one place? I know we could set up xvar = "hp"
before the ggplot() call, and use xvar in both locations to replace hp, but this feels like it ought to be unnecessary.
Another option would be to create a custom function generating the equation using dplyr
and broom
libraries:
get_formula <- function(model) {
broom::tidy(model)[, 1:2] %>%
mutate(sign = ifelse(sign(estimate) == 1, ' + ', ' - ')) %>% #coeff signs
mutate_if(is.numeric, ~ abs(round(., 2))) %>% #for improving formatting
mutate(a = ifelse(term == '(Intercept)', paste0('y ~ ', estimate), paste0(sign, estimate, ' * ', term))) %>%
summarise(formula = paste(a, collapse = '')) %>%
as.character
}
lm(y ~ x, data = df) -> model
get_formula(model)
#"y ~ 6.22 + 3.16 * x"
scales::percent(summary(model)$r.squared, accuracy = 0.01) -> r_squared
Now we need to add the text to the plot:
p +
geom_text(x = 20, y = 300,
label = get_formula(model),
color = 'red') +
geom_text(x = 20, y = 285,
label = r_squared,
color = 'blue')
https://i.stack.imgur.com/YwvAi.png
Inspired by the equation style provided in this answer, a more generic approach (more than one predictor + latex output as option) can be:
print_equation= function(model, latex= FALSE, ...){
dots <- list(...)
cc= model$coefficients
var_sign= as.character(sign(cc[-1]))%>%gsub("1","",.)%>%gsub("-"," - ",.)
var_sign[var_sign==""]= ' + '
f_args_abs= f_args= dots
f_args$x= cc
f_args_abs$x= abs(cc)
cc_= do.call(format, args= f_args)
cc_abs= do.call(format, args= f_args_abs)
pred_vars=
cc_abs%>%
paste(., x_vars, sep= star)%>%
paste(var_sign,.)%>%paste(., collapse= "")
if(latex){
star= " \\cdot "
y_var= strsplit(as.character(model$call$formula), "~")[[2]]%>%
paste0("\\hat{",.,"_{i}}")
x_vars= names(cc_)[-1]%>%paste0(.,"_{i}")
}else{
star= " * "
y_var= strsplit(as.character(model$call$formula), "~")[[2]]
x_vars= names(cc_)[-1]
}
equ= paste(y_var,"=",cc_[1],pred_vars)
if(latex){
equ= paste0(equ," + \\hat{\\varepsilon_{i}} \\quad where \\quad \\varepsilon \\sim \\mathcal{N}(0,",
summary(MetamodelKdifEryth)$sigma,")")%>%paste0("$",.,"$")
}
cat(equ)
}
The model
argument expects an lm
object, the latex
argument is a boolean to ask for a simple character or a latex-formated equation, and the ...
argument pass its values to the format
function.
I also added an option to output it as latex so you can use this function in a rmarkdown like this:
```{r echo=FALSE, results='asis'}
print_equation(model = lm_mod, latex = TRUE)
```
Now using it:
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
df$z <- 8 + 3 * df$x + rnorm(100, sd = 40)
lm_mod= lm(y~x+z, data = df)
print_equation(model = lm_mod, latex = FALSE)
This code yields: y = 11.3382963933174 + 2.5893419 * x + 0.1002227 * z
And if we ask for a latex equation, rounding the parameters to 3 digits:
print_equation(model = lm_mod, latex = TRUE, digits= 3)
https://i.stack.imgur.com/D38Kz.png
Similar to @zx8754 and @kdauria answers except using ggplot2
and ggpubr
. I prefer using ggpubr
because it does not require custom functions such as the top answer to this question.
library(ggplot2)
library(ggpubr)
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
ggplot(data = df, aes(x = x, y = y)) +
stat_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_point() +
stat_cor(aes(label = paste(..rr.label..)), # adds R^2 value
r.accuracy = 0.01,
label.x = 0, label.y = 375, size = 4) +
stat_regline_equation(aes(label = ..eq.label..), # adds equation to linear regression
label.x = 0, label.y = 400, size = 4)
https://i.stack.imgur.com/uT8oz.jpg
Could also add p-value to the figure above
ggplot(data = df, aes(x = x, y = y)) +
stat_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_point() +
stat_cor(aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")), # adds R^2 and p-value
r.accuracy = 0.01,
p.accuracy = 0.001,
label.x = 0, label.y = 375, size = 4) +
stat_regline_equation(aes(label = ..eq.label..), # adds equation to linear regression
label.x = 0, label.y = 400, size = 4)
https://i.stack.imgur.com/d7fvx.jpg
Also works well with facet_wrap()
when you have multiple groups
df$group <- rep(1:2,50)
ggplot(data = df, aes(x = x, y = y)) +
stat_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
geom_point() +
stat_cor(aes(label = paste(..rr.label.., ..p.label.., sep = "~`,`~")),
r.accuracy = 0.01,
p.accuracy = 0.001,
label.x = 0, label.y = 375, size = 4) +
stat_regline_equation(aes(label = ..eq.label..),
label.x = 0, label.y = 400, size = 4) +
theme_bw() +
facet_wrap(~group)
https://i.stack.imgur.com/or08N.jpg
Success story sharing
annotate
was correct on my machine.aes(
and the corresponding)
.aes
is for mapping dataframe variables to visual variables - that's not needed here, since there's just one instance, so you can put it all in the maingeom_text
call. I'll edit this in to the answer.