Keywords: R | glm | contrasts error | debugging | factor variables
Abstract: This article provides a detailed guide to debugging the 'contrasts can be applied only to factors with 2 or more levels' error in R. By analyzing common causes, it introduces helper functions and step-by-step procedures to systematically identify and resolve issues with insufficient factor levels. The content covers data preprocessing, model frame retrieval, and practical case studies, with rewritten code examples to illustrate key concepts.
Error Overview
In R, a common error when fitting models with lm or glm is "contrasts can be applied only to factors with 2 or more levels." This typically occurs when factor variables have only one or zero levels in a data subset, often due to missing values, subset selection, or unused levels. For systematic debugging, helper functions such as debug_contr_error, debug_contr_error2, and NA_preproc can be employed.
Debugging Steps
The debug_contr_error function mimics the internal data processing of lm and glm, involving four main steps:
- Explicit Subsetting: If a
subsetargument is used, first subset the data explicitly. - Remove Incomplete Cases: Use
na.omitto eliminate missing values. - Mode Checking and Conversion: Check variable modes and convert logical and character variables to factors.
- Drop Unused Factor Levels: Apply
droplevelsto factor variables.
The function outputs the number and list of factor levels. Core code is as follows:
debug_contr_error <- function (dat, subset_vec = NULL) {
## Step 0: Explicit subsetting
if (!is.null(subset_vec)) {
if (mode(subset_vec) == "logical") {
if (length(subset_vec) != nrow(dat)) stop("'logical' subset_vec length mismatch");
subset_log_vec <- subset_vec;
} else if (mode(subset_vec) == "numeric") {
ran <- range(subset_vec);
if (ran[1] < 1 || ran[2] > nrow(dat)) stop("'numeric' subset_vec out of bound");
subset_log_vec <- logical(nrow(dat));
subset_log_vec[as.integer(subset_vec)] <- TRUE;
} else stop("subset_vec must be 'logical' or 'numeric'");
dat <- base::subset(dat, subset = subset_log_vec);
} else {
dat <- stats::na.omit(dat);
}
if (nrow(dat) == 0L) warning("no complete cases");
## Step 2: Mode checking and conversion
var_mode <- sapply(dat, mode);
if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!");
var_class <- sapply(dat, class);
if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) stop("matrix variables with 'AsIs' class must be 'numeric'");
ind1 <- which(var_mode %in% c("logical", "character"));
dat[ind1] <- lapply(dat[ind1], as.factor);
## Step 3: Drop unused factor levels
fctr <- which(sapply(dat, is.factor));
if (length(fctr) == 0L) warning("no factor variables to summary");
ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr;
dat[ind2] <- lapply(dat[ind2], base::droplevels.factor);
## Step 4: Summarize factor variables
lev <- lapply(dat[fctr], base::levels.default);
nl <- lengths(lev);
list(nlevels = nl, levels = lev);
}More Flexible Implementation
The debug_contr_error2 function uses the model.frame method to first retrieve the model frame, then applies debug_contr_error. This approach handles variable transformations in formulas, such as log transforms, and tracks missing values. Key code:
debug_contr_error2 <- function (form, dat, subset_vec = NULL) {
if (!is.null(subset_vec)) {
if (mode(subset_vec) == "logical") {
if (length(subset_vec) != nrow(dat)) stop("'logical' subset_vec length mismatch");
subset_log_vec <- subset_vec;
} else if (mode(subset_vec) == "numeric") {
ran <- range(subset_vec);
if (ran[1] < 1 || ran[2] > nrow(dat)) stop("'numeric' subset_vec out of bound");
subset_log_vec <- logical(nrow(dat));
subset_log_vec[as.integer(subset_vec)] <- TRUE;
} else stop("subset_vec must be 'logical' or 'numeric'");
dat <- base::subset(dat, subset = subset_log_vec);
}
dat_internal <- stats::lm(form, data = dat, method = "model.frame");
attr(dat_internal, "terms") <- NULL;
c(list(mf = dat_internal), debug_contr_error(dat_internal, NULL));
}Case Studies
For example, with character variables, direct use of str may not reveal issues, but the debug functions can indicate single-level factors. Consider this data:
dat <- data.frame(y = 1:4,
x = c(1:3, NA),
f1 = gl(2, 2, labels = letters[1:2]),
f2 = c("A", "A", "A", "B"),
stringsAsFactors = FALSE)
lm(y ~ x + f1 + f2, dat) ## Will throw contrast error
## Use debug_contr_error for debugging
debug_contr_error(dat)$nlevels
## Output: f1: 2 levels, f2: 1 levelThis shows that f2 ends up with only one level in complete cases, causing the error.
Data Preprocessing
To retain missing values, the NA_preproc function adds NA as a factor level, preserving more cases. Core code:
NA_preproc <- function (dat) {
for (j in 1:ncol(dat)) {
x <- dat[[j]];
if (is.factor(x) && anyNA(x)) dat[[j]] <- base::addNA(x);
if (is.character(x)) dat[[j]] <- factor(x, exclude = NULL);
}
dat;
}Solutions
Based on debugging results, solutions include:
- On full datasets, if factor levels are insufficient, drop the variable or use
NA_preproc. - For group-wise fitting, ensure each subset has adequate factor levels or adjust formulas dynamically.
- During prediction, ensure new data has the same factor levels as training data.
By applying these methods, the "contrasts" error can be effectively avoided or resolved, facilitating smooth model fitting.