Plotting Decision Boundaries for 2D Gaussian Data Using Matplotlib: From Theoretical Derivation to Python Implementation

Keywords: Decision Boundary | Matplotlib | Gaussian Distribution | Python | Data Visualization

Abstract: This article provides a comprehensive guide to plotting decision boundaries for two-class Gaussian distributed data in 2D space. Starting with mathematical derivation of the boundary equation, we implement data generation and visualization using Python's NumPy and Matplotlib libraries. The paper compares direct analytical solutions, contour plotting methods, and SVM-based approaches from scikit-learn, with complete code examples and implementation details.

Introduction

In pattern recognition and machine learning, decision boundaries are fundamental concepts for separating different classes of data. When data points follow Gaussian distributions, optimal decision boundaries can be derived through probabilistic models. This article uses two-class Gaussian data in 2D space as an example to demonstrate the complete process from mathematical derivation to code implementation for plotting decision boundaries.

Data Generation and Visualization

First, we generate two classes of 2D data points following multivariate normal distributions using NumPy. Class 1 has mean (0,0) and covariance matrix [[2,0],[0,2]]; Class 2 has mean (1,2) and covariance matrix [[1,0],[0,1]]. Each class contains 100 sample points.

import numpy as np
from matplotlib import pyplot as plt

# Generate Class 1 data
mu_vec1 = np.array([0,0])
cov_mat1 = np.array([[2,0],[0,2]])
x1_samples = np.random.multivariate_normal(mu_vec1, cov_mat1, 100)
mu_vec1 = mu_vec1.reshape(1,2).T  # Convert to column vector

# Generate Class 2 data
mu_vec2 = np.array([1,2])
cov_mat2 = np.array([[1,0],[0,1]])
x2_samples = np.random.multivariate_normal(mu_vec2, cov_mat2, 100)
mu_vec2 = mu_vec2.reshape(1,2).T  # Convert to column vector

# Plot scatter points
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(x1_samples[:,0], x1_samples[:,1], marker='o', color='green', 
           s=40, alpha=0.5, label='Class1')
ax.scatter(x2_samples[:,0], x2_samples[:,1], marker='^', color='blue', 
           s=40, alpha=0.5, label='Class2')
plt.legend(loc='upper right')
plt.title('2D Distribution of Two Classes')
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

Mathematical Derivation of Decision Boundary

For data following Gaussian distributions, the optimal decision boundary can be obtained by comparing posterior probabilities. Assuming equal prior probabilities for both classes, the decision rule simplifies to comparing class-conditional probability density functions. For multivariate normal distributions, the class-conditional PDF is:

p(x|w_i) = (2π)^-d/2|Σ_i|^-1/2 exp[-½(x-μ_i)^TΣ_i^-1(x-μ_i)]

where d=2 is the dimensionality. The decision boundary satisfies p(x|w₁) = p(x|w₂). Substituting specific parameters and simplifying with logarithms yields the boundary equation:

g(x) = (x-μ₁)^TΣ₁^-1(x-μ₁) - (x-μ₂)^TΣ₂^-1(x-μ₂) + ln(|Σ₁|/|Σ₂|) = 0

For the diagonal covariance matrices in this article, the equation further simplifies to a quadratic equation in x₁ and x₂. By solving for x₂ in terms of x₁, we obtain a plottable boundary function.

Python Implementation and Visualization

Based on the above derivation, we can implement decision boundary plotting. Here's the complete Python code:

def decision_boundary(x1):
    """Calculate x2 values for the decision boundary"""
    # Analytical solution derived from specific parameters
    return 4 - np.sqrt(-x1**2 + 4*x1 + 6 + np.log(16))

# Generate boundary points
x1_range = np.arange(-5, 5, 0.1)
x2_boundary = decision_boundary(x1_range)

# Plot the boundary
plt.plot(x1_range, x2_boundary, 'r--', linewidth=3, label='Decision Boundary')
plt.legend()
plt.show()

Comparison of Alternative Implementation Methods

Besides the direct analytical approach, several common methods exist for plotting decision boundaries:

Contour Plotting Method

Using Matplotlib's contour function to directly plot the g(x)=0 contour:

def decision_function(x_vec, mu_vec1, mu_vec2):
    """Decision function"""
    g1 = (x_vec - mu_vec1).T.dot((x_vec - mu_vec1))
    g2 = 2 * ((x_vec - mu_vec2).T.dot((x_vec - mu_vec2)))
    return g1 - g2

# Create mesh grid
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)

# Compute decision function values
Z = np.zeros_like(X)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        vec = np.array([X[i,j], Y[i,j]]).reshape(2,1)
        Z[i,j] = decision_function(vec, mu_vec1, mu_vec2)

# Plot contour
plt.contour(X, Y, Z, levels=[0], colors='red', linewidths=2)

SVM-Based Approach Using scikit-learn

For more complex cases, support vector machines can automatically learn decision boundaries:

from sklearn import svm

# Prepare training data
X_train = np.concatenate((x1_samples, x2_samples), axis=0)
y_train = np.array([0]*100 + [1]*100)  # Class labels

# Train linear SVM
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X_train, y_train)

# Get decision boundary parameters
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]

# Plot boundary
plt.plot(xx, yy, 'k-', linewidth=2)

Applications and Extensions

Plotting decision boundaries has significant importance not only in theoretical analysis but also in practical applications:

Model Evaluation: Visualizing decision boundaries helps intuitively understand classifier performance and behavior.
Feature Analysis: By observing boundary shapes, one can determine which features contribute more to classification.
Anomaly Detection: Points far from decision boundaries may indicate anomalies or noisy data.
Multi-class Extension: For multi-class problems, multiple one-vs-one decision boundaries can be plotted.

For non-linearly separable data, kernel tricks can map data to higher-dimensional spaces where linear decision boundaries can be found, then projected back to the original space.

Conclusion

This article systematically introduces methods for plotting decision boundaries for Gaussian distributed data in 2D space. From mathematical derivation to Python implementation, we demonstrate a complete solution workflow. The direct analytical approach suits simple cases with known parameters, while contour plotting and SVM methods offer greater generality. In practical applications, appropriate methods should be selected based on data characteristics and requirements. Effective visualization not only verifies the correctness of theoretical derivations but also provides intuitive insights for model understanding and optimization.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.