Keywords: Polynomial Regression | R-squared | NumPy | Curve Fitting | Coefficient of Determination
Abstract: This article provides a comprehensive guide on calculating R-squared (coefficient of determination) for polynomial regression using Python and NumPy. It explains the statistical meaning of R-squared, identifies issues in the original code for higher-degree polynomials, and presents the correct calculation method based on the ratio of regression sum of squares to total sum of squares. The article compares implementations across different libraries and provides complete code examples for building a universal polynomial regression function.
Fundamental Concepts of Polynomial Regression and R-squared
In statistics and machine learning, polynomial regression is a widely used technique for curve fitting that models nonlinear relationships between variables using polynomial functions. R-squared (coefficient of determination) is a crucial metric for evaluating the goodness of fit in regression models, ranging from 0 to 1, with values closer to 1 indicating better explanatory power of the model.
Analysis of Issues in Original Code
The original code provided by the user contains a critical flaw in R-squared calculation:
correlation = numpy.corrcoef(x, y)[0,1]
results['determination'] = correlation**2
This approach only works for linear regression (degree=1) because numpy.corrcoef() computes Pearson correlation coefficient, which measures linear relationships between variables. For higher-degree polynomial regression, where relationships are nonlinear, this method becomes invalid.
Correct Method for Calculating R-squared
According to statistical principles, the general formula for R-squared is:
R² = SS_reg / SS_tot = 1 - SS_res / SS_tot
Where:
SS_totrepresents total sum of squares:Σ(y_i - ȳ)²SS_regrepresents regression sum of squares:Σ(ŷ_i - ȳ)²SS_resrepresents residual sum of squares:Σ(y_i - ŷ_i)²
Improved Polynomial Regression Function Implementation
Based on these principles, we can rewrite the polyfit function:
import numpy
def polyfit(x, y, degree):
results = {}
# Perform polynomial fitting using numpy.polyfit
coeffs = numpy.polyfit(x, y, degree)
results['polynomial'] = coeffs.tolist()
# Create polynomial function
p = numpy.poly1d(coeffs)
# Calculate fitted values
yhat = p(x)
# Calculate mean
ybar = numpy.sum(y) / len(y)
# Calculate regression sum of squares
ssreg = numpy.sum((yhat - ybar) ** 2)
# Calculate total sum of squares
sstot = numpy.sum((y - ybar) ** 2)
# Calculate R-squared value
results['determination'] = ssreg / sstot
return results
Comparison with Other Library Implementations
Beyond manual calculation with NumPy, other scientific computing libraries offer alternative approaches:
Using SciPy's linregress
For linear regression, scipy.stats.linregress can be used:
import scipy.stats
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
r_squared = r_value ** 2
Using scikit-learn's r2_score
For arbitrary regression models, sklearn.metrics.r2_score provides a general solution:
from sklearn.metrics import r2_score
# Assuming p is the polynomial function
y_pred = p(x)
coefficient_of_determination = r2_score(y, y_pred)
Practical Application Example
Let's verify the improved function with a concrete example:
import numpy as np
# Generate sample data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 8, 16, 32, 64, 128, 256, 512, 1024])
# Quadratic polynomial fit
results_quadratic = polyfit(x, y, 2)
print(f"Quadratic polynomial R-squared: {results_quadratic['determination']:.6f}")
# Cubic polynomial fit
results_cubic = polyfit(x, y, 3)
print(f"Cubic polynomial R-squared: {results_cubic['determination']:.6f}")
Performance Optimization Considerations
When dealing with large-scale data, consider the following optimization strategies:
- Use
numpy.mean()instead of manual mean calculation - Leverage NumPy's vectorized operations to avoid loops
- For very high-degree polynomials, consider more stable numerical algorithms
Conclusion
This article has detailed the correct method for calculating R-squared in polynomial regression. The key insight is understanding the statistical meaning of R-squared and computing it based on the ratio of regression sum of squares to total sum of squares, rather than simply squaring the correlation coefficient. The improved function properly handles polynomial regression of any degree and produces results consistent with tools like Excel.