Keywords: Python | requests module | PDF download | binary files | encoding errors
Abstract: This article provides an in-depth analysis of common encoding errors when downloading PDF files with Python requests module and their solutions. By comparing the differences between response.text and response.content, it explains the handling distinctions between binary and text files, and offers optimized methods for streaming large file downloads. The article includes complete code examples and detailed technical analysis to help developers avoid common file download pitfalls.
Problem Background and Common Errors
When using Python requests module to download PDF files, many developers encounter encoding errors or generate blank files. This typically stems from insufficient understanding of response content handling methods.
A typical erroneous code example:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.text)Executing this code throws a UnicodeEncodeError exception because response.text returns a decoded string object, while PDF files are in binary format. Direct writing causes encoding conflicts.
Correct Solutions
Using response.content Attribute
For binary files (such as PDFs, images, audio, etc.), the response.content attribute should be used, which returns raw byte content:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)response.content directly provides byte data, avoiding encoding conversion processes and ensuring binary file integrity.
Differences Between response.text and response.content
response.text:
- Returns decoded string objects
- Suitable for text files (HTML, TXT, JSON, etc.)
- Automatically decodes based on character set in response headers
response.content:
- Returns raw byte objects
- Suitable for binary files (PDF, images, archives, etc.)
- Preserves original file format without encoding conversion
Advanced Optimization: Streaming Download
For large files, using streaming download can significantly reduce memory consumption:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in response.iter_content(chunk_size=8192):
fd.write(chunk)Technical Advantages:
- Reads and writes in chunks, avoiding loading large files into memory at once
- Customizable chunk size (chunk_size) to balance memory usage and I/O efficiency
- Supports resumable downloads and progress monitoring
Suitable Scenarios:
- Downloading large files exceeding 100MB
- Unstable network environments
- Requiring real-time download progress display
Error Handling and Best Practices
Complete Error Handling Mechanism
import requests
try:
response = requests.get(url, stream=True, timeout=30)
response.raise_for_status() # Check HTTP status code
with open('/tmp/metadata.pdf', 'wb') as f:
if stream:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
else:
f.write(response.content)
except requests.exceptions.RequestException as e:
print(f"Download failed: {e}")
except IOError as e:
print(f"File write error: {e}")Performance Optimization Recommendations
- Use
response.contentfor small files (<10MB) for concise and efficient code - Use streaming download for large files to prevent memory overflow
- Set reasonable timeout values to avoid prolonged blocking
- Verify file integrity (e.g., check file size or MD5 hash)
In-depth Technical Principle Analysis
HTTP responses are essentially byte streams; the difference lies in how the client interprets this data. response.text internally uses the encoder specified by response.encoding to convert bytes to strings, a process that corrupts data for binary files.
PDF files contain extensive binary data and non-ASCII characters. Using string-based processing causes:
- Byte order disruption
- Incorrect escaping of special characters
- File structure damage
In contrast, response.content directly manipulates raw bytes, preserving binary file integrity—this is the correct approach for handling non-text files.
Conclusion
The key to correctly handling PDF and other binary file downloads lies in understanding the fundamental differences in HTTP responses. By using response.content or streaming downloads, file integrity and download efficiency can be ensured. Developers should choose appropriate methods based on file size and specific requirements, and establish comprehensive error handling mechanisms.