Calculating Performance Metrics from Confusion Matrix in Scikit-learn: From TP/TN/FP/FN to Sensitivity/Specificity

Keywords: Confusion Matrix | True Positive | Sensitivity | Scikit-learn | Cross Validation

Abstract: This article provides a comprehensive guide on extracting True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) metrics from confusion matrices in Scikit-learn. Through practical code examples, it demonstrates how to compute these fundamental metrics during K-fold cross-validation and derive essential evaluation parameters like sensitivity and specificity. The discussion covers both binary and multi-class classification scenarios, offering practical guidance for machine learning model assessment.

Introduction

In machine learning classification tasks, while accuracy provides an intuitive evaluation metric, it often fails to comprehensively reflect model performance. Particularly in imbalanced datasets, relying solely on accuracy can lead to misleading conclusions. Therefore, deep understanding of confusion matrices and their derived metrics becomes crucial.

Fundamentals of Confusion Matrix

The confusion matrix serves as the core tool for classification model performance evaluation, presenting the correspondence between model predictions and actual labels in tabular form. For binary classification problems, the confusion matrix contains four fundamental elements:

True Positive (TP): Number of positive class samples correctly predicted as positive
True Negative (TN): Number of negative class samples correctly predicted as negative
False Positive (FP): Number of negative class samples incorrectly predicted as positive
False Negative (FN): Number of positive class samples incorrectly predicted as negative

Direct Calculation Method Based on Prediction Results

Within Scikit-learn's cross-validation workflow, we can directly compute these metrics by comparing prediction results with actual labels. Here's a practical function implementation:

def calculate_performance_metrics(y_actual, y_predicted):
    TP = 0
    FP = 0
    TN = 0
    FN = 0
    
    for i in range(len(y_predicted)): 
        if y_actual[i] == y_predicted[i] == 1:
            TP += 1
        if y_predicted[i] == 1 and y_actual[i] != y_predicted[i]:
            FP += 1
        if y_actual[i] == y_predicted[i] == 0:
            TN += 1
        if y_predicted[i] == 0 and y_actual[i] != y_predicted[i]:
            FN += 1
    
    return TP, FP, TN, FN

This function iterates through each prediction result, accumulating individual metrics based on the combination relationship between actual and predicted labels. Note that this method assumes a binary classification scenario with positive class label as 1 and negative class label as 0.

Integrated Application in K-Fold Cross-Validation

Integrating the above function into a complete machine learning workflow, particularly in K-fold cross-validation environments, ensures robust model evaluation. Below is a complete implementation example:

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import scale
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# Data preprocessing
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(trainList)
X = scale(X.toarray())

# Configure K-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Store performance metrics for each fold
total_metrics = []

for train_indices, test_indices in kf.split(X):
    # Split training and test sets
    X_train = X[train_indices]
    X_test = X[test_indices]
    y_train = [labelList[i] for i in train_indices]
    y_test = [labelList[i] for i in test_indices]
    
    # Model training
    qda = QuadraticDiscriminantAnalysis()
    trained_model = qda.fit(X_train, y_train)
    
    # Prediction
    predictions = qda.predict(X_test)
    
    # Calculate basic metrics
    accuracy = accuracy_score(y_test, predictions)
    confusion_mat = confusion_matrix(y_test, predictions)
    
    # Calculate TP, FP, TN, FN
    TP, FP, TN, FN = calculate_performance_metrics(y_test, predictions)
    
    # Calculate derived metrics
    sensitivity = TP / (TP + FN) if (TP + FN) > 0 else 0
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
    
    total_metrics.append({
        'accuracy': accuracy,
        'TP': TP, 'FP': FP, 'TN': TN, 'FN': FN,
        'sensitivity': sensitivity,
        'specificity': specificity
    })

Interpretation and Application of Performance Metrics

Based on the computed TP, FP, TN, FN values, we can further derive several important performance metrics:

Sensitivity and Specificity

Sensitivity (Recall) measures the model's ability to identify positive class samples, calculated as TP/(TP+FN). In scenarios like medical diagnosis, high sensitivity means lower missed detection rates for diseases.

Specificity measures the model's ability to identify negative class samples, calculated as TN/(TN+FP). High specificity indicates lower false positive rates, particularly important for screening tests.

Other Related Metrics

Precision: TP/(TP+FP), measuring the accuracy of positive predictions
F1 Score: Harmonic mean of precision and recall
Accuracy: (TP+TN)/(TP+TN+FP+FN), overall classification correctness rate

Extension to Multi-Class Scenarios

For multi-class problems, we can approach them as multiple binary classification problems. For each class, treat it as the positive class and all other classes as negative, then compute metrics separately. This approach is known as the "one-vs-rest" strategy.

def multiclass_metrics(y_actual, y_predicted, classes):
    metrics_per_class = {}
    
    for class_label in classes:
        # Treat current class as positive, others as negative
        y_actual_binary = [1 if label == class_label else 0 for label in y_actual]
        y_predicted_binary = [1 if prediction == class_label else 0 for prediction in y_predicted]
        
        TP, FP, TN, FN = calculate_performance_metrics(y_actual_binary, y_predicted_binary)
        
        metrics_per_class[class_label] = {
            'TP': TP, 'FP': FP, 'TN': TN, 'FN': FN,
            'sensitivity': TP/(TP+FN) if (TP+FN) > 0 else 0,
            'specificity': TN/(TN+FP) if (TN+FP) > 0 else 0
        }
    
    return metrics_per_class

Practical Application Recommendations

In actual projects, it's recommended to encapsulate performance metric calculations as reusable modules and consider the following best practices:

Compute mean and standard deviation of metrics in cross-validation for more robust evaluation
For imbalanced datasets, prioritize sensitivity and specificity over accuracy
Select appropriate evaluation metrics based on business requirements, as different scenarios may emphasize different metrics
Use visualization tools (such as confusion matrix heatmaps) to intuitively display model performance

Conclusion

By systematically calculating and analyzing fundamental metrics like TP, TN, FP, FN, we gain deep insights into classification model performance characteristics. Combined with derived metrics like sensitivity and specificity, this provides strong support for model optimization and business decision-making. Within the Scikit-learn framework, these computations can be efficiently integrated into standard machine learning workflows, ensuring accurate and reproducible evaluations.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.