Creating Grouped Bar Plots with ggplot2: Visualizing Multiple Variables by a Factor

Dec 03, 2025 · Programming · 21 views · 7.8

Keywords: ggplot2 | grouped bar plot | data visualization

Abstract: This article provides a comprehensive guide on using the ggplot2 package in R to create grouped bar plots for visualizing average percentages of beverage consumption across different genders (a factor variable). It covers data preprocessing steps, including mean calculation with the aggregate function and data reshaping to long format, followed by a step-by-step demonstration of ggplot2 plotting with geom_bar, position adjustments, and aesthetic mappings. By comparing two approaches (manual mean calculation vs. using stat_summary), the article offers flexible solutions for data visualization, emphasizing core concepts such as data reshaping and plot customization.

Introduction

In data visualization, grouped bar plots are a common chart type used to compare values of multiple variables across different categories. In R, the ggplot2 package offers powerful tools for creating such plots. This article is based on a specific case study to demonstrate how to visualize beverage consumption percentages grouped by a gender factor. The original data includes four variables (tea, coke, beer, water) and a gender factor (coded as 1 and 2), with the goal of producing a bar plot where the x-axis represents beverage types, bars are grouped side-by-side by gender, and the y-axis shows the mean values of the variables (representing percentages).

Data Preprocessing

First, it is necessary to calculate the mean of each variable per gender group. The aggregate function can be used conveniently for this purpose. Assuming the data frame is named df with columns tea, coke, beer, water, and gender, run the following code:

means <- aggregate(df, by = list(df$gender), mean)

This generates a data frame where the Group.1 column indicates gender groups, and other columns contain the means. Since the gender column already exists in the original data, it becomes redundant after aggregation, so remove the Group.1 column:

means <- means[, 2:length(means)]

Next, to meet the plotting requirements of ggplot2, the data needs to be transformed into long format. Use the melt function from the reshape2 package:

library(reshape2)
means.long <- melt(means, id.vars = "gender")

The transformed data frame includes three columns: gender (gender), variable (beverage type), and value (mean). This format facilitates aesthetic mapping in ggplot2.

Creating Grouped Bar Plots with ggplot2

After loading the ggplot2 package, the plot can be created based on the long-format data. The core code uses the geom_bar geometric object with position = "dodge" to achieve side-by-side bar display. Example code is as follows:

library(ggplot2)
ggplot(means.long, aes(x = variable, y = value, fill = factor(gender))) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_discrete(name = "Gender",
                      breaks = c(1, 2),
                      labels = c("Male", "Female")) +
  xlab("Beverage") + ylab("Mean Percentage")

In this code, the aes function defines aesthetic mappings: the x-axis maps to variable (beverage type), the y-axis maps to value (mean), and fill color maps to gender (converted to a factor). The stat = "identity" parameter in geom_bar indicates that the y-values from the data are used directly, without statistical summarization. scale_fill_discrete customizes the legend, labeling gender codes 1 and 2 as "Male" and "Female", respectively. Axis labels are set via xlab and ylab.

Alternative Approach: Using the stat_summary Function

Instead of manually calculating means, ggplot2's stat_summary function can perform summarization directly during plotting, simplifying the process. First, transform the original data to long format:

gg <- melt(df, id = "gender")

Then, use stat_summary to create the bar plot:

ggplot(gg, aes(x = variable, y = value, fill = factor(gender))) + 
  stat_summary(fun.y = mean, geom = "bar", position = position_dodge(1)) + 
  scale_fill_discrete("Gender")

This approach avoids explicit mean calculation steps but follows a similar principle: fun.y = mean specifies the summary function as mean, and geom = "bar" defines the geometric object. Additionally, error bars can be added to show data variability, for example, using minimum and maximum values:

ggplot(gg, aes(x = variable, y = value, fill = factor(gender))) + 
  stat_summary(fun.y = mean, geom = "bar", position = position_dodge(1)) + 
  stat_summary(fun.ymin = min, fun.ymax = max, geom = "errorbar",
               color = "grey40", position = position_dodge(1), width = 0.2) +
  scale_fill_discrete("Gender")

Here, the second stat_summary call adds error bars, with fun.ymin and fun.ymax set to min and max functions, and geom = "errorbar" defining the error bar geometric object. position_dodge(1) ensures alignment of error bars with the bars.

Discussion and Conclusion

This article demonstrates two methods for creating grouped bar plots: one based on manual data preprocessing with geom_bar, and another utilizing stat_summary for dynamic summarization. Both methods effectively visualize mean values of multiple variables grouped by a factor, but each has its advantages and disadvantages. The manual method offers finer control, suitable for complex data manipulations, while the stat_summary method is more concise, ideal for quick exploratory analysis. In practice, the choice depends on specific needs and data scale. Key concepts include data reshaping (from wide to long format), ggplot2's aesthetic mapping system, and position adjustment techniques. By mastering these, users can flexibly create various grouped visualizations to support data-driven decision-making.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.