Keywords: Python | wget | resume download | urllib.request | HTTP Range header | network download
Abstract: This article provides an in-depth exploration of implementing wget-like features including resume download, timeout retry, and infinite retry mechanisms in Python. Through detailed analysis of the urllib.request module, it covers HTTP Range header implementation, timeout control strategies, and robust retry logic. The paper compares alternative approaches using requests library and third-party wget module, offering complete code implementations and performance optimization recommendations for building reliable file download functionality.
Introduction
In file download scenarios, the wget command with -c --read-timeout=5 --tries=0 parameters provides robust fault tolerance: supporting resume download, 5-second read timeout retry, and infinite retry capabilities. This article deeply analyzes how to implement this feature combination in Python, focusing on building a complete download solution based on the urllib.request module.
Core Functional Requirements Analysis
To achieve equivalent functionality to the wget command, three key problems must be solved:
Resume Download Mechanism: When download is interrupted, continue from the previously downloaded position to avoid re-downloading already acquired data.
Timeout Retry Strategy: Set 5-second read timeout, automatically retry when no data transmission occurs within the specified time.
Infinite Retry Logic: Continuously retry in unstable network environments until download completion.
urllib.request Implementation Solution
Based on Python's standard library urllib.request module, we can construct a complete downloader. Here's the core implementation logic:
import urllib.request
import urllib.error
import os
import time
def resilient_download(url, local_path, timeout=5, max_retries=0):
"""
Implement wget-style resume download
Parameters:
url: Download URL
local_path: Local save path
timeout: Read timeout in seconds
max_retries: Maximum retry attempts (0 means infinite retry)
"""
retry_count = 0
while True:
try:
# Check if local file exists to determine resume position
start_byte = 0
if os.path.exists(local_path):
start_byte = os.path.getsize(local_path)
# Create request object and set Range header
req = urllib.request.Request(url)
if start_byte > 0:
req.add_header('Range', f'bytes={start_byte}-')
# Open URL connection
with urllib.request.urlopen(req, timeout=timeout) as response:
# Get total file size (for progress display)
total_size = int(response.headers.get('content-length', 0))
if start_byte > 0:
total_size += start_byte
# Open local file in append mode
mode = 'ab' if start_byte > 0 else 'wb'
with open(local_path, mode) as local_file:
while True:
chunk = response.read(8192) # 8KB chunk size
if not chunk:
break
local_file.write(chunk)
local_file.flush() # Ensure data written to disk
print(f"Download completed: {local_path}")
break
except (urllib.error.URLError, ConnectionError, TimeoutError) as e:
retry_count += 1
print(f"Download failed (attempt {retry_count}): {e}")
# Check if maximum retry count reached
if max_retries > 0 and retry_count >= max_retries:
print("Maximum retry count reached, download terminated")
break
# Wait before retry
time.sleep(2)
continue
# Usage example
if __name__ == "__main__":
url = "https://example.com/large-file.zip"
local_file = "downloaded_file.zip"
resilient_download(url, local_file, timeout=5, max_retries=0)
Key Technical Details Analysis
HTTP Range Header for Resume Download
The core of resume download lies in the HTTP protocol's Range request header. When the server supports range requests, the client can specify the byte range to download using Range: bytes=start-end header. In our implementation:
if start_byte > 0:
req.add_header('Range', f'bytes={start_byte}-')
This code, when detecting an existing local file, requests data starting from the start_byte position from the server. The server responds with 206 Partial Content status code and the requested data range.
Timeout Control and Retry Mechanism
The timeout parameter in urllib.request.urlopen controls connection and read operation timeout:
with urllib.request.urlopen(req, timeout=timeout) as response:
When no data transmission occurs within the specified time, a TimeoutError exception is raised, triggering the retry logic. Infinite retry is achieved through the max_retries=0 parameter, continuously retrying in the while loop until success.
File Operation Optimization
Using appropriate file opening modes ensures data integrity:
mode = 'ab' if start_byte > 0 else 'wb'
with open(local_path, mode) as local_file:
Initial download uses 'wb' mode to create the file, while resume uses 'ab' mode to append data. Regular flush() calls ensure data is written to disk, preventing data loss during abnormal program termination.
Alternative Solutions Comparison
Requests Library Solution
Although Answer 3 mentions the requests library, its simple implementation requests.get(url).content lacks resume download and timeout retry features. A complete requests implementation requires manual handling of Range headers and streaming download:
import requests
def requests_resume_download(url, local_path, timeout=5):
headers = {}
start_byte = 0
if os.path.exists(local_path):
start_byte = os.path.getsize(local_path)
headers['Range'] = f'bytes={start_byte}-'
response = requests.get(url, headers=headers, stream=True, timeout=timeout)
mode = 'ab' if start_byte > 0 else 'wb'
with open(local_path, mode) as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
f.flush()
Third-party wget Module
As mentioned in Answer 1, Python's wget module provides simple download functionality:
import wget
filename = wget.download(url)
However, this module hasn't been updated since 2015, lacks advanced features like custom timeout and infinite retry, and has compatibility issues with certain file types.
Performance Optimization and Error Handling
Chunk Size Optimization
Appropriate chunk size significantly impacts download performance. Smaller chunks (e.g., 1KB) increase system call overhead, while larger chunks (e.g., 1MB) may perform poorly in high-latency networks. 8KB-64KB typically provides a good balance.
Refined Exception Handling
In practical applications, different types of network errors should be distinguished:
try:
# Download logic
except TimeoutError:
# Timeout error, retry immediately
except ConnectionResetError:
# Connection reset, wait before retry
time.sleep(5)
except urllib.error.HTTPError as e:
if e.code == 416: # Range Not Satisfiable
# Handle case where file is already fully downloaded
os.remove(local_path) # Delete corrupted file
retry_count = 0 # Reset retry count
Practical Application Recommendations
Environment Configuration Considerations
As discussed in the reference article, compatibility across different Python versions and environments is crucial. Using virtual environments for dependency management and pyenv for multi-version Python management is recommended:
# Install specific Python version using pyenv
pyenv install 3.11.9
pyenv local 3.11.9
Production Environment Deployment
In production environments, consider adding the following features:
• Download progress display
• Download speed limiting
• Concurrent download support
• Download queue management
• Comprehensive logging
Conclusion
Through in-depth analysis of the urllib.request module's core mechanisms, we have successfully implemented equivalent functionality to the wget command. Resume download based on HTTP Range headers, precise timeout control, and robust retry logic form a complete download solution. While the requests library and third-party modules offer alternative approaches, the standard library solution provides the best compatibility and control. In practical applications, choosing the appropriate implementation based on specific requirements, while fully considering error handling and performance optimization, enables building truly "unfailiable" download functionality.