Keywords: Pandas indexing error | iloc vs loc | data shuffling | machine learning data preprocessing | KeyError solution
Abstract: This article provides an in-depth analysis of a common Pandas error: "KeyError: None of [Int64Index...] are in the columns". Through a practical data preprocessing case study, it explains why this error occurs when using np.random.shuffle() with DataFrames that have non-consecutive indices. The article systematically compares the fundamental differences between loc and iloc indexing methods, offers complete solutions, and extends the discussion to the importance of proper index handling in machine learning data preparation. Finally, reconstructed code examples demonstrate how to avoid such errors and ensure correct data shuffling operations.
Problem Context and Error Analysis
In data science and machine learning projects, data preprocessing is a critical step. Shuffling datasets to eliminate potential order bias is a common operation. However, many developers encounter a confusing error when attempting to use np.random.shuffle() with Pandas DataFrames: KeyError: "None of [Int64Index([...])] are in the [columns]". While this error appears to relate to column names, it actually stems from a misunderstanding of Pandas indexing mechanisms.
Error Reproduction and Root Cause
In the provided code example, the developer first creates a DataFrame scaled_inputs_all by selecting specific columns from the original data using the loc method. The critical issue is that when using loc for column selection, the resulting DataFrame retains the original row indices, which may not be consecutive integers starting from 0.
The developer then creates a consecutive integer array for shuffling:
shuffled_indices = np.arange(scaled_inputs_all.shape[0])
np.random.shuffle(shuffled_indices)
When attempting to access the DataFrame with these shuffled indices:
shuffled_inputs = scaled_inputs_all[shuffled_indices]
Pandas interprets these integers as row labels (index values) rather than row positions. Since scaled_inputs_all may have non-consecutive indices, many values in shuffled_indices don't exist in the DataFrame's index, triggering the KeyError.
Fundamental Differences Between loc and iloc
The key to understanding this error lies in distinguishing between Pandas' two indexing methods:
- loc: Label-based indexing. It uses the DataFrame's actual index values (which can be integers, strings, or other types) to select data. When passed a list of integers, Pandas treats them as index labels, not positions.
- iloc: Integer position-based indexing. It strictly uses zero-based integer positions to select data, completely ignoring the DataFrame's actual index values.
In data shuffling scenarios, we care about the relative positions of data rows, not their labels. Therefore, iloc is the correct choice.
Complete Solution
The corrected code should use iloc for position-based indexing:
# Create consecutive shuffle indices
shuffled_indices = np.arange(scaled_inputs_all.shape[0])
np.random.shuffle(shuffled_indices)
# Use iloc for position-based indexing
shuffled_inputs = scaled_inputs_all.iloc[shuffled_indices]
shuffled_targets = targets_all.iloc[shuffled_indices]
This approach ensures that regardless of the original DataFrame's index, we correctly shuffle the data rows.
Extended Discussion and Best Practices
This error occurs not only in simple data shuffling but also in multiple aspects of machine learning workflows:
- Cross-validation: When using methods like KFold for cross-validation,
ilocmust be used to ensure proper train-test splits. - Data sampling: For random sampling or stratified sampling, position-based indexing prevents index mismatch issues.
- Data concatenation: When merging multiple DataFrames, resetting indices (using
reset_index(drop=True)) simplifies subsequent indexing operations.
A robust data preprocessing pipeline should include these steps:
# 1. Data loading and initial processing
df = pd.read_csv('data.csv')
# 2. Feature engineering and target variable definition
# ... (various data processing steps) ...
# 3. Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# 4. Reset indices to ensure continuity (optional but recommended)
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
# 5. Shuffle data
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X_shuffled = X.iloc[indices]
y_shuffled = y.iloc[indices]
# 6. Subsequent processing (standardization, splitting, etc.)
# ...
Code Refactoring and Optimization
Based on the original code, we can refactor a clearer, more robust data preprocessing function:
def preprocess_and_shuffle_data(filepath):
"""
Load data, perform preprocessing, and return shuffled features and target
Parameters:
filepath: Path to CSV file
Returns:
X_shuffled: Shuffled feature DataFrame
y_shuffled: Shuffled target Series
"""
# Load data
df = pd.read_csv(filepath)
# Data preprocessing steps (adjust based on actual requirements)
# 1. Remove unnecessary columns
if 'ID' in df.columns:
df = df.drop('ID', axis=1)
# 2. Create target variable
median_absence = df['Absenteeism Time in Hours'].median()
df['Excessive Absenteeism'] = np.where(
df['Absenteeism Time in Hours'] > median_absence, 1, 0
)
# 3. Date feature engineering
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Month Value'] = df['Date'].dt.month
df['Day of the Week'] = df['Date'].dt.weekday
df = df.drop('Date', axis=1)
# 4. Feature scaling
numeric_cols = ['Month Value', 'Day of the Week',
'Transportation Expense', 'Distance to Work',
'Age', 'Daily Work Load Average',
'Body Mass Index', 'Children', 'Pets',
'Absenteeism Time in Hours']
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
# 5. Categorical variable encoding
# Reason encoding
reason_dummies = pd.get_dummies(df['Reason for Absence'], drop_first=True)
reason_groups = {
'Reason_1': reason_dummies.loc[:, 1:14].max(axis=1),
'Reason_2': reason_dummies.loc[:, 15:17].max(axis=1),
'Reason_3': reason_dummies.loc[:, 18:21].max(axis=1),
'Reason_4': reason_dummies.loc[:, 22:].max(axis=1)
}
for name, series in reason_groups.items():
df[name] = series
df = df.drop('Reason for Absence', axis=1)
# Education encoding
df['Education'] = df['Education'].map({1:0, 2:1, 3:1, 4:1})
# Separate features and target
feature_cols = [col for col in df.columns if col != 'Excessive Absenteeism']
X = df[feature_cols]
y = df['Excessive Absenteeism']
# Reset indices
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
# Shuffle data
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X_shuffled = X.iloc[indices]
y_shuffled = y.iloc[indices]
return X_shuffled, y_shuffled
Conclusion
The distinction between loc and iloc indexing mechanisms in Pandas is fundamental knowledge that data science practitioners must master. When handling data shuffling, cross-validation, or any position-based operations, iloc should be prioritized to avoid index mismatch errors. By understanding the nature of DataFrame indices and adopting best practices such as resetting indices and explicitly using position-based indexing, more robust and maintainable data processing pipelines can be built. Remember: loc for labels, iloc for positions—this simple principle prevents many common Pandas errors.