Keywords: Python | Excel | xlrd | Data Processing | Custom Classes
Abstract: This article explores methods for dynamically extracting data from specific columns in Excel files based on configurable column name formats using Python. By analyzing the xlrd library and custom class implementations, it presents a structured solution that avoids inefficient traditional looping and indexing. The article also integrates best practices in data transformation to demonstrate flexible and maintainable data processing workflows.
Introduction
In data processing tasks, it is often necessary to extract data from specific columns in Excel files, where the column selection may be dynamically configured. Traditional methods involving row-by-row looping and index lookup are feasible but result in verbose and hard-to-maintain code. This article introduces an elegant solution based on the Python xlrd library and custom classes, enabling efficient data extraction by column names with support for flexible format configurations.
Problem Analysis
Consider an Excel file with columns: Arm_id, DSPName, DSPCode, HubCode, PinCode, and PPTL. The user needs to extract data from columns specified by a configurable format list, such as ['Arm_id', 'DSPName', 'Pincode']. The conventional approach reads the entire file into a two-dimensional list and then searches for column name indices to retrieve data, but this method is complex and error-prone.
Solution: Using Custom Classes
By defining a custom class, each row of data can be encapsulated as an object, allowing direct access to column values via attribute names. Below is a complete implementation example:
from xlrd import open_workbook
class Arm(object):
def __init__(self, id, dsp_name, dsp_code, hub_code, pin_code, pptl):
self.id = id
self.dsp_name = dsp_name
self.dsp_code = dsp_code
self.hub_code = hub_code
self.pin_code = pin_code
self.pptl = pptl
def __str__(self):
return("Arm object:\n"
" Arm_id = {0}\n"
" DSPName = {1}\n"
" DSPCode = {2}\n"
" HubCode = {3}\n"
" PinCode = {4} \n"
" PPTL = {5}"
.format(self.id, self.dsp_name, self.dsp_code,
self.hub_code, self.pin_code, self.pptl))
wb = open_workbook('sample.xls')
for sheet in wb.sheets():
number_of_rows = sheet.nrows
number_of_columns = sheet.ncols
items = []
rows = []
for row in range(1, number_of_rows):
values = []
for col in range(number_of_columns):
value = (sheet.cell(row,col).value)
try:
value = str(int(value))
except ValueError:
pass
finally:
values.append(value)
item = Arm(*values)
items.append(item)
for item in items:
print(item)
print("Accessing one single value (eg. DSPName): {0}".format(item.dsp_name))
print()Code Explanation
The above code defines an Arm class whose constructor accepts all column values as parameters. When reading the Excel file, starting from the second row (skipping the header), each row is instantiated as an Arm object. Specific column values can be accessed directly via dot notation (e.g., item.dsp_name), without concern for column order or indices.
Flexible Data Extraction
To support configurable column formats, the code can be extended. For example, given a format list FORMAT = ['Arm_id', 'DSPName', 'Pincode'], the required data can be dynamically generated as follows:
FORMAT = ['Arm_id', 'DSPName', 'Pincode']
selected_data = []
for item in items:
row_data = {}
for col_name in FORMAT:
row_data[col_name] = getattr(item, col_name.lower())
selected_data.append(row_data)
print(selected_data)This approach utilizes Python's getattr function to dynamically retrieve object attributes based on column names, ensuring code flexibility and maintainability.
Best Practices in Data Transformation
In data processing, type conversion is often required. Drawing from PySpark practices, a type mapping dictionary can be defined to dynamically convert column data types. For instance:
data_type_map = {
'Arm_id': int,
'PinCode': str
}
for item in items:
for col_name, data_type in data_type_map.items():
setattr(item, col_name.lower(), data_type(getattr(item, col_name.lower())))
This pattern ensures data type correctness while keeping the code concise.
Conclusion
By employing custom classes and dynamic attribute access, data from specific columns in Excel files can be extracted efficiently and flexibly. This method enhances code readability and maintainability, supporting complex configuration needs. Integrated with best practices in data transformation, it enables the construction of robust data processing pipelines suitable for various practical applications.