Calculating R-squared for Polynomial Regression Using NumPy

Keywords: Polynomial Regression | R-squared | NumPy | Curve Fitting | Coefficient of Determination

Abstract: This article provides a comprehensive guide on calculating R-squared (coefficient of determination) for polynomial regression using Python and NumPy. It explains the statistical meaning of R-squared, identifies issues in the original code for higher-degree polynomials, and presents the correct calculation method based on the ratio of regression sum of squares to total sum of squares. The article compares implementations across different libraries and provides complete code examples for building a universal polynomial regression function.

Fundamental Concepts of Polynomial Regression and R-squared

In statistics and machine learning, polynomial regression is a widely used technique for curve fitting that models nonlinear relationships between variables using polynomial functions. R-squared (coefficient of determination) is a crucial metric for evaluating the goodness of fit in regression models, ranging from 0 to 1, with values closer to 1 indicating better explanatory power of the model.

Analysis of Issues in Original Code

The original code provided by the user contains a critical flaw in R-squared calculation:

correlation = numpy.corrcoef(x, y)[0,1]
results['determination'] = correlation**2

This approach only works for linear regression (degree=1) because numpy.corrcoef() computes Pearson correlation coefficient, which measures linear relationships between variables. For higher-degree polynomial regression, where relationships are nonlinear, this method becomes invalid.

Correct Method for Calculating R-squared

According to statistical principles, the general formula for R-squared is:

R² = SS_reg / SS_tot = 1 - SS_res / SS_tot

Where:

SS_tot represents total sum of squares: Σ(y_i - ȳ)²
SS_reg represents regression sum of squares: Σ(ŷ_i - ȳ)²
SS_res represents residual sum of squares: Σ(y_i - ŷ_i)²

Improved Polynomial Regression Function Implementation

Based on these principles, we can rewrite the polyfit function:

import numpy

def polyfit(x, y, degree):
    results = {}
    
    # Perform polynomial fitting using numpy.polyfit
    coeffs = numpy.polyfit(x, y, degree)
    results['polynomial'] = coeffs.tolist()
    
    # Create polynomial function
    p = numpy.poly1d(coeffs)
    
    # Calculate fitted values
    yhat = p(x)
    
    # Calculate mean
    ybar = numpy.sum(y) / len(y)
    
    # Calculate regression sum of squares
    ssreg = numpy.sum((yhat - ybar) ** 2)
    
    # Calculate total sum of squares
    sstot = numpy.sum((y - ybar) ** 2)
    
    # Calculate R-squared value
    results['determination'] = ssreg / sstot
    
    return results

Comparison with Other Library Implementations

Beyond manual calculation with NumPy, other scientific computing libraries offer alternative approaches:

Using SciPy's linregress

For linear regression, scipy.stats.linregress can be used:

import scipy.stats
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
r_squared = r_value ** 2

Using scikit-learn's r2_score

For arbitrary regression models, sklearn.metrics.r2_score provides a general solution:

from sklearn.metrics import r2_score

# Assuming p is the polynomial function
y_pred = p(x)
coefficient_of_determination = r2_score(y, y_pred)

Practical Application Example

Let's verify the improved function with a concrete example:

import numpy as np

# Generate sample data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2, 4, 8, 16, 32, 64, 128, 256, 512, 1024])

# Quadratic polynomial fit
results_quadratic = polyfit(x, y, 2)
print(f"Quadratic polynomial R-squared: {results_quadratic['determination']:.6f}")

# Cubic polynomial fit
results_cubic = polyfit(x, y, 3)
print(f"Cubic polynomial R-squared: {results_cubic['determination']:.6f}")

Performance Optimization Considerations

When dealing with large-scale data, consider the following optimization strategies:

Use numpy.mean() instead of manual mean calculation
Leverage NumPy's vectorized operations to avoid loops
For very high-degree polynomials, consider more stable numerical algorithms

Conclusion

This article has detailed the correct method for calculating R-squared in polynomial regression. The key insight is understanding the statistical meaning of R-squared and computing it based on the ratio of regression sum of squares to total sum of squares, rather than simply squaring the correlation coefficient. The improved function properly handles polynomial regression of any degree and produces results consistent with tools like Excel.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.