Linear Regression Analysis and Visualization with NumPy and Matplotlib

Keywords: Linear Regression | NumPy | Matplotlib | Data Visualization | Python Programming

Abstract: This article provides a comprehensive guide to performing linear regression analysis on list data using Python's NumPy and Matplotlib libraries. By examining the core mechanisms of the np.polyfit function, it demonstrates how to convert ordinary list data into formats suitable for polynomial fitting and utilizes np.poly1d to create reusable regression functions. The paper also explores visualization techniques for regression lines, including scatter plot creation, regression line styling, and axis range configuration, offering complete implementation solutions for data science and machine learning practices.

Data Preparation and NumPy Array Conversion

In Python data analysis, handling data in list format is a common requirement. The NumPy library provides powerful array manipulation capabilities that efficiently handle numerical computation tasks. When using np.polyfit for linear regression, this function can automatically process Python list inputs without requiring explicit conversion using np.arange.

Consider the following sample data: two lists x and y containing integer elements. In NumPy, these lists can be directly passed to the polyfit function:

import numpy as np
import matplotlib.pyplot as plt

# Define sample data
x = [1, 2, 3, 4]
y = [3, 5, 7, 10]

# Perform linear regression fitting
coef = np.polyfit(x, y, 1)
print(f"Regression coefficients: {coef}")

The output will display the numerical values of slope m and intercept b. It's noteworthy that even when input data contains slight noise (such as changing the last y value from 9 to 10), polyfit can still calculate the optimal fitting line through the least squares method.

Application of np.poly1d Function

np.poly1d is a practical tool in NumPy for creating one-dimensional polynomial functions. It converts the coefficients returned by polyfit into callable functions, significantly simplifying the computation and visualization of regression lines.

# Create polynomial function
poly1d_fn = np.poly1d(coef)

# Verify function functionality
print(f"Predicted value at x=2.5: {poly1d_fn(2.5)}")

# Generate prediction sequence
x_pred = np.linspace(0, 5, 50)
y_pred = poly1d_fn(x_pred)

This approach not only makes the code more concise but also supports predictions for arbitrary x values, providing convenience for subsequent analysis.

Data Visualization Implementation

The Matplotlib library offers rich data visualization capabilities. Combined with linear regression results, it enables the creation of intuitive scatter plot and regression line combination charts.

# Create figure and axes
fig, ax = plt.subplots(figsize=(8, 6))

# Draw scatter plot
ax.scatter(x, y, color='yellow', marker='o', s=80, 
           edgecolors='black', alpha=0.7, label='Original Data')

# Draw regression line
ax.plot(x_pred, y_pred, '--k', linewidth=2, 
        label='Regression Line')

# Set axis ranges
ax.set_xlim(0, 5)
ax.set_ylim(0, 12)

# Add legend and labels
ax.legend()
ax.set_xlabel('X Variable')
ax.set_ylabel('Y Variable')
ax.set_title('Linear Regression Analysis Results')

plt.tight_layout()
plt.show()

In this visualization, yellow circles mark the original data points, while black dashed lines represent the regression line. By adjusting parameters such as edgecolors, alpha, and linewidth, the readability and aesthetics of the chart can be optimized.

Advanced Applications and Error Analysis

In practical applications, evaluating linear regression is crucial. The coefficient of determination R² can be calculated to quantify the goodness of fit of the model:

# Calculate predicted values
y_actual = np.array(y)
y_predicted = poly1d_fn(x)

# Calculate R² value
ss_res = np.sum((y_actual - y_predicted) ** 2)
ss_tot = np.sum((y_actual - np.mean(y_actual)) ** 2)
r_squared = 1 - (ss_res / ss_tot)

print(f"Coefficient of determination R²: {r_squared:.4f}")

For more complex data distributions, consider using random number generators to create simulated datasets:

# Create reproducible random data
rng = np.random.default_rng(42)
x_random = rng.uniform(0, 10, size=50)
y_random = 2 * x_random + 1 + rng.normal(scale=1.5, size=50)

# Perform regression analysis on new data
coef_random = np.polyfit(x_random, y_random, 1)
poly_fn_random = np.poly1d(coef_random)

This method is particularly suitable for verifying the robustness of regression algorithms and testing performance under different data distributions.

Best Practices and Considerations

When using linear regression, several key points require attention: First, ensure the data meets the basic assumptions of linear relationships; Second, outliers may significantly affect regression results, requiring appropriate data cleaning; Finally, for polynomial fitting, higher-order models are prone to overfitting, so polynomial degrees should be chosen carefully.

By combining NumPy's numerical computation capabilities with Matplotlib's visualization functions, Python provides a powerful and flexible toolset for linear regression analysis, applicable across various scenarios from simple educational examples to complex industrial applications.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Data Preparation and NumPy Array Conversion

Application of np.poly1d Function

Data Visualization Implementation

Advanced Applications and Error Analysis

Best Practices and Considerations

Cite this article