Keywords: ggplot2 | Linear Regression | R² Value | Data Visualization | Statistical Graphics
Abstract: This article provides a detailed exploration of methods for adding regression equations and coefficient of determination R² to linear regression plots in R's ggplot2 package. It comprehensively analyzes implementation approaches using base R functions and the ggpmisc extension package, featuring complete code examples that demonstrate workflows from simple text annotations to advanced statistical labels, with in-depth discussion of formula parsing, position adjustment, and grouped data handling.
Introduction
In data visualization analysis, linear regression stands as one of the most frequently employed statistical methods. Displaying regression equations and goodness-of-fit metrics directly on plots provides intuitive insight into quantitative relationships between variables. While ggplot2 serves as R's most popular plotting system, offering flexible graphical construction capabilities, it does not natively support displaying regression statistics on plots.
Basic Implementation Approach
Constructing custom annotation functions using base R functions represents the most straightforward solution. By fitting linear models with lm(), extracting coefficients and R² values, then employing substitute() and as.expression() functions to generate parseable mathematical expressions.
library(ggplot2)
# Create sample data
df <- data.frame(x = 1:100)
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)
# Define regression equation generation function
lm_eqn <- function(df) {
m <- lm(y ~ x, df)
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(unname(coef(m)[1]), digits = 2),
b = format(unname(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq))
}
# Construct plot and add annotation
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_smooth(method = "lm", se = FALSE, color = "black", formula = y ~ x) +
geom_point() +
geom_text(x = 25, y = 300, label = lm_eqn(df), parse = TRUE)
Function Implementation Details
The core of the custom lm_eqn function lies in mathematical expression construction. The substitute() function enables variable value embedding within strings, while functions like italic() ensure output conforms to mathematical typesetting standards. The parse = TRUE parameter allows ggplot2 to correctly interpret mathematical expressions, generating aesthetically pleasing formula displays.
Coefficient formatting employs the format() function to control display precision, with unname() removing name attributes to ensure pure numerical output. R² values extracted from model summaries reflect the model's explanatory power regarding the data.
Using the ggpmisc Extension Package
For more complex statistical annotation requirements, the ggpmisc package offers professional-grade solutions. The stat_poly_eq() function specifically designs for adding polynomial regression statistics to plots, supporting multiple label combinations and formatting options.
library(ggplot2)
library(ggpmisc)
# Using default settings
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq() +
geom_point()
# Custom label combinations
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(use_label(c("eq", "R2"))) +
geom_point()
# Including additional statistical metrics
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(use_label(c("eq", "adj.R2", "f", "p", "n"))) +
geom_point()
Advanced Features and Application Scenarios
The ggpmisc package supports various regression types and complex scenarios:
Polynomial Regression: Supports higher-order polynomial fitting through specified formula parameters.
# Quadratic polynomial fitting
df$yy <- 2 + 3 * df$x + 0.1 * df$x^2 + rnorm(100, sd = 40)
ggplot(data = df, aes(x = x, y = yy)) +
stat_poly_line(formula = y ~ poly(x, 2, raw = TRUE)) +
stat_poly_eq(formula = y ~ poly(x, 2, raw = TRUE), use_label("eq")) +
geom_point()
Grouped Data Analysis: Supports separate fitting and annotation by grouping variables.
# Create grouped data
dfg <- data.frame(x = 1:100)
dfg$y <- 20 * c(0, 1) + 3 * dfg$x + rnorm(100, sd = 40)
dfg$group <- factor(rep(c("A", "B"), 50))
# Grouped fitting and annotation
ggplot(data = dfg, aes(x = x, y = y, colour = group)) +
stat_poly_line() +
stat_poly_eq(use_label(c("eq", "R2"))) +
geom_point()
Faceted Plots: Automatically adds corresponding statistical information for each panel in faceted layouts.
ggplot(data = dfg, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(use_label(c("eq", "R2"))) +
geom_point() +
facet_wrap(~group)
Technical Considerations and Best Practices
Label Position Optimization: Precisely controls annotation positions through label.x and label.y parameters to avoid overlap with data points.
# Adjusting label positions
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(use_label("eq"), label.y = 0.9) +
stat_poly_eq(use_label("R2"), label.y = 0.8) +
geom_point()
Mathematical Symbol Customization: Supports custom variable symbols and formula formatting to accommodate different disciplinary representation conventions.
# Custom variable symbols
ggplot(data = df, aes(x = x, y = y)) +
stat_poly_line() +
stat_poly_eq(eq.with.lhs = "italic(h)~`=`~",
eq.x.rhs = "~italic(z)",
use_label("eq")) +
labs(x = expression(italic(z)), y = expression(italic(h))) +
geom_point()
Performance Considerations and Compatibility
The basic method offers optimal compatibility, working with any ggplot2 version. The ggpmisc approach provides richer functionality but requires additional package installation, recommended for complex projects. For large datasets, consider precomputing statistics outside plots to enhance rendering performance.
Conclusion
Multiple implementation pathways exist for adding regression equations and R² values in ggplot2, ranging from simple manual annotations to professional automated solutions. Method selection depends on specific requirements: basic approaches suit rapid prototyping and simple applications, while the ggpmisc package delivers production-grade statistical plot annotation capabilities. Mastering these techniques significantly enhances data visualization professionalism and information density.