Efficiently Retrieving Sheet Names from Excel Files: Performance Optimization Strategies Without Full File Loading

Keywords: Excel | sheet names | performance optimization | xlrd | on_demand

Abstract: When handling large Excel files, traditional methods like pandas or xlrd that load the entire file to obtain sheet names can cause significant performance bottlenecks. This article delves into the technical principles of on-demand loading using xlrd's on_demand parameter, which reads only file metadata instead of all content, thereby greatly improving efficiency. It also analyzes alternative solutions, including openpyxl's read-only mode, the pyxlsb library, and low-level methods for parsing xlsx compressed files, demonstrating optimization effects in different scenarios through comparative experimental data. The core lies in understanding Excel file structures and selecting appropriate library parameters to avoid unnecessary memory consumption and time overhead.

In data processing and analysis, Excel files are a common data source due to their widespread use. However, when file sizes are large, such as datasets with 70 columns × 65,000 rows, directly loading the entire file using libraries like pandas or xlrd to obtain sheet names can take up to 14 seconds, which is unacceptable in practical applications. The root cause is that these libraries default to reading all worksheet content, including cell data, rather than just metadata.

xlrd's on_demand Parameter: The Core Mechanism of On-Demand Loading

The xlrd library provides a key parameter on_demand=True, which alters the workbook loading behavior. When this parameter is set, xlrd only parses the header information of the Excel file, including sheet names, without immediately reading each worksheet's data into memory. This is similar to a lazy loading pattern, where content is loaded only when a specific worksheet is actually accessed. The following code example demonstrates how to apply this optimization:

import xlrd
xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
print(xls.sheet_names())

Note that sheet_names is a method in xlrd, not a property, so it must be called. This approach reduces loading time from seconds to milliseconds by avoiding the overhead of reading 65,000 rows of data. Experiments show that for the aforementioned large file, using the on_demand parameter can decrease operation time from 14 seconds to less than 1 second, achieving a performance improvement of over 10x.

Optimization Solutions with Other Libraries

Beyond xlrd, other libraries offer similar mechanisms. The openpyxl library supports the read_only=True parameter, which loads only the workbook structure without populating cell values. For xlsx files, this can yield about a 14x performance improvement, as benchmarked: from 3.25 seconds to 225 milliseconds. Code implementation is as follows:

from openpyxl import load_workbook
def get_sheetnames_xlsx(filepath):
    wb = load_workbook(filepath, read_only=True, keep_links=False)
    return wb.sheetnames

For xlsb format (Excel Binary files), the pyxlsb library is an efficient choice. It optimizes loading automatically via the open_workbook context manager, with measured performance gains up to 56x, from 5.36 seconds to 96.4 milliseconds. Example code:

from pyxlsb import open_workbook
def get_sheetnames_xlsb(filepath):
    with open_workbook(filepath) as wb:
        return wb.sheets

Low-Level Parsing: Direct Handling of xlsx File Structure

For xlsx files, a more low-level optimization method leverages their ZIP-based compression nature. xlsx files are essentially compressed packages containing XML files, where xl/workbook.xml stores worksheet metadata. By directly extracting and parsing this XML file, one can bypass the overhead of all high-level libraries. The following function illustrates this process:

import os
import zipfile
import xmltodict
import shutil

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['sheetId'],
                'name': sheet['name']
            }
            sheets.append(sheet_details)
    shutil.rmtree(directory_to_extract_to)
    return sheets

This method takes only 0.4 seconds on a 6MB xlsx file, compared to 12 seconds for pandas or xlrd and 24 seconds for openpyxl. However, it increases code complexity and is only applicable to xlsx format.

Performance Comparison and Selection Recommendations

Considering all solutions, xlrd's on_demand parameter is the most straightforward and compatible choice, especially for xls files. However, note that xlrd has been unmaintained since 2020, which may lead to compatibility issues in future versions. For xlsx files, openpyxl's read-only mode is the recommended alternative, while xlsb files should use pyxlsb. The low-level parsing method, though fast, relies on internal file format structures and is suitable for scenarios with extreme performance requirements.

In practical applications, it is advisable to choose based on file format and library maintenance status. For example, in mixed-format environments, one can first detect the file extension and then dispatch to the corresponding optimized function. This ensures maximum performance benefits while maintaining code clarity.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

xlrd's on_demand Parameter: The Core Mechanism of On-Demand Loading

Optimization Solutions with Other Libraries

Low-Level Parsing: Direct Handling of xlsx File Structure

Performance Comparison and Selection Recommendations

Cite this article