Analysis and Solutions for "Unsupported Format, or Corrupt File" Error in Python xlrd Library

Keywords: Python | xlrd | Excel file reading | File format error | HTML table parsing

Abstract: This article provides an in-depth analysis of the "Unsupported format, or corrupt file" error encountered when using Python's xlrd library to process Excel files. Through concrete case studies, it reveals the root cause: mismatch between file extensions and actual formats. The paper explains xlrd's working principles in detail and offers multiple diagnostic methods and solutions, including using text editors to verify file formats, employing pandas' read_html function for HTML-formatted files, and proper file format identification techniques. With code examples and principle analysis, it helps developers fundamentally resolve such file reading issues.

Problem Phenomenon and Error Analysis

When using Python's xlrd library to read Excel files, developers often encounter the "Unsupported format, or corrupt file" error. This error is typically accompanied by detailed error messages, such as:

Traceback (most recent call last):
File "Z:\Wilson\tradedStockStatus.py", line 18, in <module>
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls")
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 429, in open_workbook
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1545, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1539, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record;
found '<table r'"

Root Cause Analysis

From the error message, the essence of the problem is clear: the xlrd library expects to find Excel's BOF (Beginning of File) record at the file's beginning, but actually finds the string '<table r'. This indicates that the file is not actually in true Excel binary format, but rather in HTML format.

The working principle of the xlrd library is based on parsing Excel files' binary formats. When the open_workbook function is called, the library first checks the file's magic number to determine the file format. For traditional .xls files, it should start with a specific binary signature. However, when the file is actually in HTML format, the beginning typically contains HTML tags, causing xlrd to fail format recognition.

Diagnostic Methods

To confirm the file's actual format, the most direct method is to open the file with a text editor:

# Use a text editor in command line to view file content
# In Windows systems, you can use notepad
notepad "Z:\Data\Locates\3.8 locates.xls"

# Or use Python's built-in methods to read the file beginning
with open("Z:\Data\Locates\3.8 locates.xls", 'r', encoding='utf-8', errors='ignore') as f:
    first_bytes = f.read(100)
    print(f"File beginning content: {first_bytes}")

If the file beginning displays HTML tags (such as <table>, <html>, etc.), it can be confirmed that the file is actually in HTML format, just incorrectly named with a .xls extension.

Solutions

Method 1: Using pandas' read_html Function

For HTML-formatted table files, the pandas library provides specialized reading functions:

import pandas as pd

# Read HTML-formatted table data
try:
    data_frames = pd.read_html("Z:\Data\Locates\3.8 locates.xls")
    
    # read_html returns a list of DataFrames, each table corresponds to one DataFrame
    for i, df in enumerate(data_frames):
        print(f"Table {i+1}:")
        print(df.head())
        print("\n" + "="*50 + "\n")
        
except Exception as e:
    print(f"Error reading file: {e}")

Method 2: Using Other HTML Parsing Libraries

Besides pandas, specialized HTML parsing libraries can also be used:

from bs4 import BeautifulSoup
import requests

# If the file is a local HTML file
with open("Z:\Data\Locates\3.8 locates.xls", 'r', encoding='utf-8') as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, 'html.parser')
tables = soup.find_all('table')

for i, table in enumerate(tables):
    print(f"Found table {i+1}")
    # Further process table data...

In-depth Understanding of File Formats

Excel files come in various formats, and understanding these differences is crucial for proper file processing:

Traditional Binary Format (.xls): Uses BIFF (Binary Interchange File Format), with specific binary signatures at the file beginning
Office Open XML Format (.xlsx): XML-based file format compressed with ZIP
HTML Tables: Use HTML tags to represent table data, can be opened by Excel but are essentially text files

In practical development, many data sources provide HTML-formatted "Excel files" because this method is simple and cross-platform compatible. Developers need the ability to identify and handle such special cases.

Preventive Measures and Best Practices

To avoid similar issues, the following measures are recommended:

File Format Validation: Verify the file's actual format before reading

import magic  # Requires python-magic library installation

def detect_file_type(file_path):
    file_type = magic.from_file(file_path, mime=True)
    return file_type

file_type = detect_file_type("Z:\Data\Locates\3.8 locates.xls")
print(f"Actual file type: {file_type}")

Error Handling Mechanism: Implement robust error handling with meaningful error messages

def safe_read_excel(file_path):
    try:
        # First attempt to read as Excel file
        import xlrd
        wb = xlrd.open_workbook(file_path)
        return wb
    except xlrd.biffh.XLRDError as e:
        if "Expected BOF record" in str(e):
            # Possibly HTML format, try using pandas
            try:
                import pandas as pd
                return pd.read_html(file_path)
            except Exception:
                raise ValueError("File is neither valid Excel format nor readable HTML table")
        else:
            raise e

File Naming Conventions: Ensure file extensions match actual formats
Documentation: Clearly specify the file's actual format during data exchange

Practical Application Cases

Referring to the case mentioned in the GitHub issue, some .xls files provided by the U.S. Energy Information Administration (EIA) for coal production data are actually in HTML format. This situation is quite common in government agencies and public data sources because HTML format is easier to generate and maintain.

When processing such data, developers should:

First attempt standard Excel reading methods
If that fails, check the file's actual format
Choose appropriate parsing methods based on the actual format
Document such special cases in code for future maintenance

Conclusion

The "Unsupported format, or corrupt file" error is typically not caused by file corruption, but rather by mismatch between file format and extension. By carefully analyzing error messages, verifying actual file formats, and selecting appropriate parsing tools, developers can effectively resolve such issues. Understanding the characteristics and applicable scenarios of different file formats is crucial for building robust data processing pipelines.

In practical development, adopting defensive programming strategies, performing pre-processing validation of file formats, and providing alternative parsing methods are recommended to ensure the stability and reliability of data processing workflows.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.