Keywords: Python | xlrd | Excel file reading | File format error | HTML table parsing
Abstract: This article provides an in-depth analysis of the "Unsupported format, or corrupt file" error encountered when using Python's xlrd library to process Excel files. Through concrete case studies, it reveals the root cause: mismatch between file extensions and actual formats. The paper explains xlrd's working principles in detail and offers multiple diagnostic methods and solutions, including using text editors to verify file formats, employing pandas' read_html function for HTML-formatted files, and proper file format identification techniques. With code examples and principle analysis, it helps developers fundamentally resolve such file reading issues.
Problem Phenomenon and Error Analysis
When using Python's xlrd library to read Excel files, developers often encounter the "Unsupported format, or corrupt file" error. This error is typically accompanied by detailed error messages, such as:
Traceback (most recent call last):
File "Z:\Wilson\tradedStockStatus.py", line 18, in <module>
wb = xlrd.open_workbook("Z:\\Data\\Locates\\3.8 locates.xls")
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 429, in open_workbook
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1545, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 1539, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record;
found '<table r'"
Root Cause Analysis
From the error message, the essence of the problem is clear: the xlrd library expects to find Excel's BOF (Beginning of File) record at the file's beginning, but actually finds the string '<table r'. This indicates that the file is not actually in true Excel binary format, but rather in HTML format.
The working principle of the xlrd library is based on parsing Excel files' binary formats. When the open_workbook function is called, the library first checks the file's magic number to determine the file format. For traditional .xls files, it should start with a specific binary signature. However, when the file is actually in HTML format, the beginning typically contains HTML tags, causing xlrd to fail format recognition.
Diagnostic Methods
To confirm the file's actual format, the most direct method is to open the file with a text editor:
# Use a text editor in command line to view file content
# In Windows systems, you can use notepad
notepad "Z:\Data\Locates\3.8 locates.xls"
# Or use Python's built-in methods to read the file beginning
with open("Z:\Data\Locates\3.8 locates.xls", 'r', encoding='utf-8', errors='ignore') as f:
first_bytes = f.read(100)
print(f"File beginning content: {first_bytes}")
If the file beginning displays HTML tags (such as <table>, <html>, etc.), it can be confirmed that the file is actually in HTML format, just incorrectly named with a .xls extension.
Solutions
Method 1: Using pandas' read_html Function
For HTML-formatted table files, the pandas library provides specialized reading functions:
import pandas as pd
# Read HTML-formatted table data
try:
data_frames = pd.read_html("Z:\Data\Locates\3.8 locates.xls")
# read_html returns a list of DataFrames, each table corresponds to one DataFrame
for i, df in enumerate(data_frames):
print(f"Table {i+1}:")
print(df.head())
print("\n" + "="*50 + "\n")
except Exception as e:
print(f"Error reading file: {e}")
Method 2: Using Other HTML Parsing Libraries
Besides pandas, specialized HTML parsing libraries can also be used:
from bs4 import BeautifulSoup
import requests
# If the file is a local HTML file
with open("Z:\Data\Locates\3.8 locates.xls", 'r', encoding='utf-8') as f:
html_content = f.read()
soup = BeautifulSoup(html_content, 'html.parser')
tables = soup.find_all('table')
for i, table in enumerate(tables):
print(f"Found table {i+1}")
# Further process table data...
In-depth Understanding of File Formats
Excel files come in various formats, and understanding these differences is crucial for proper file processing:
- Traditional Binary Format (.xls): Uses BIFF (Binary Interchange File Format), with specific binary signatures at the file beginning
- Office Open XML Format (.xlsx): XML-based file format compressed with ZIP
- HTML Tables: Use HTML tags to represent table data, can be opened by Excel but are essentially text files
In practical development, many data sources provide HTML-formatted "Excel files" because this method is simple and cross-platform compatible. Developers need the ability to identify and handle such special cases.
Preventive Measures and Best Practices
To avoid similar issues, the following measures are recommended:
- File Format Validation: Verify the file's actual format before reading
- Error Handling Mechanism: Implement robust error handling with meaningful error messages
- File Naming Conventions: Ensure file extensions match actual formats
- Documentation: Clearly specify the file's actual format during data exchange
import magic # Requires python-magic library installation
def detect_file_type(file_path):
file_type = magic.from_file(file_path, mime=True)
return file_type
file_type = detect_file_type("Z:\Data\Locates\3.8 locates.xls")
print(f"Actual file type: {file_type}")
def safe_read_excel(file_path):
try:
# First attempt to read as Excel file
import xlrd
wb = xlrd.open_workbook(file_path)
return wb
except xlrd.biffh.XLRDError as e:
if "Expected BOF record" in str(e):
# Possibly HTML format, try using pandas
try:
import pandas as pd
return pd.read_html(file_path)
except Exception:
raise ValueError("File is neither valid Excel format nor readable HTML table")
else:
raise e
Practical Application Cases
Referring to the case mentioned in the GitHub issue, some .xls files provided by the U.S. Energy Information Administration (EIA) for coal production data are actually in HTML format. This situation is quite common in government agencies and public data sources because HTML format is easier to generate and maintain.
When processing such data, developers should:
- First attempt standard Excel reading methods
- If that fails, check the file's actual format
- Choose appropriate parsing methods based on the actual format
- Document such special cases in code for future maintenance
Conclusion
The "Unsupported format, or corrupt file" error is typically not caused by file corruption, but rather by mismatch between file format and extension. By carefully analyzing error messages, verifying actual file formats, and selecting appropriate parsing tools, developers can effectively resolve such issues. Understanding the characteristics and applicable scenarios of different file formats is crucial for building robust data processing pipelines.
In practical development, adopting defensive programming strategies, performing pre-processing validation of file formats, and providing alternative parsing methods are recommended to ensure the stability and reliability of data processing workflows.