Implementing wget-style Resume Download and Infinite Retry in Python

Abstract: This article provides an in-depth exploration of implementing wget-like features including resume download, timeout retry, and infinite retry mechanisms in Python. Through detailed analysis of the urllib.request module, it covers HTTP Range header implementation, timeout control strategies, and robust retry logic. The paper compares alternative approaches using requests library and third-party wget module, offering complete code implementations and performance optimization recommendations for building reliable file download functionality.

Introduction

In file download scenarios, the wget command with -c --read-timeout=5 --tries=0 parameters provides robust fault tolerance: supporting resume download, 5-second read timeout retry, and infinite retry capabilities. This article deeply analyzes how to implement this feature combination in Python, focusing on building a complete download solution based on the urllib.request module.

Core Functional Requirements Analysis

To achieve equivalent functionality to the wget command, three key problems must be solved:

Resume Download Mechanism: When download is interrupted, continue from the previously downloaded position to avoid re-downloading already acquired data.

Timeout Retry Strategy: Set 5-second read timeout, automatically retry when no data transmission occurs within the specified time.

Infinite Retry Logic: Continuously retry in unstable network environments until download completion.

urllib.request Implementation Solution

Based on Python's standard library urllib.request module, we can construct a complete downloader. Here's the core implementation logic:

import urllib.request
import urllib.error
import os
import time

def resilient_download(url, local_path, timeout=5, max_retries=0):
    """
    Implement wget-style resume download
    
    Parameters:
        url: Download URL
        local_path: Local save path
        timeout: Read timeout in seconds
        max_retries: Maximum retry attempts (0 means infinite retry)
    """
    retry_count = 0
    
    while True:
        try:
            # Check if local file exists to determine resume position
            start_byte = 0
            if os.path.exists(local_path):
                start_byte = os.path.getsize(local_path)
            
            # Create request object and set Range header
            req = urllib.request.Request(url)
            if start_byte > 0:
                req.add_header('Range', f'bytes={start_byte}-')
            
            # Open URL connection
            with urllib.request.urlopen(req, timeout=timeout) as response:
                # Get total file size (for progress display)
                total_size = int(response.headers.get('content-length', 0))
                if start_byte > 0:
                    total_size += start_byte
                
                # Open local file in append mode
                mode = 'ab' if start_byte > 0 else 'wb'
                with open(local_path, mode) as local_file:
                    while True:
                        chunk = response.read(8192)  # 8KB chunk size
                        if not chunk:
                            break
                        local_file.write(chunk)
                        local_file.flush()  # Ensure data written to disk
            
            print(f"Download completed: {local_path}")
            break
            
        except (urllib.error.URLError, ConnectionError, TimeoutError) as e:
            retry_count += 1
            print(f"Download failed (attempt {retry_count}): {e}")
            
            # Check if maximum retry count reached
            if max_retries > 0 and retry_count >= max_retries:
                print("Maximum retry count reached, download terminated")
                break
            
            # Wait before retry
            time.sleep(2)
            continue

# Usage example
if __name__ == "__main__":
    url = "https://example.com/large-file.zip"
    local_file = "downloaded_file.zip"
    resilient_download(url, local_file, timeout=5, max_retries=0)

Key Technical Details Analysis

HTTP Range Header for Resume Download

The core of resume download lies in the HTTP protocol's Range request header. When the server supports range requests, the client can specify the byte range to download using Range: bytes=start-end header. In our implementation:

if start_byte > 0:
    req.add_header('Range', f'bytes={start_byte}-')

This code, when detecting an existing local file, requests data starting from the start_byte position from the server. The server responds with 206 Partial Content status code and the requested data range.

Timeout Control and Retry Mechanism

The timeout parameter in urllib.request.urlopen controls connection and read operation timeout:

with urllib.request.urlopen(req, timeout=timeout) as response:

When no data transmission occurs within the specified time, a TimeoutError exception is raised, triggering the retry logic. Infinite retry is achieved through the max_retries=0 parameter, continuously retrying in the while loop until success.

File Operation Optimization

Using appropriate file opening modes ensures data integrity:

mode = 'ab' if start_byte > 0 else 'wb'
with open(local_path, mode) as local_file:

Initial download uses 'wb' mode to create the file, while resume uses 'ab' mode to append data. Regular flush() calls ensure data is written to disk, preventing data loss during abnormal program termination.

Alternative Solutions Comparison

Requests Library Solution

Although Answer 3 mentions the requests library, its simple implementation requests.get(url).content lacks resume download and timeout retry features. A complete requests implementation requires manual handling of Range headers and streaming download:

import requests

def requests_resume_download(url, local_path, timeout=5):
    headers = {}
    start_byte = 0
    
    if os.path.exists(local_path):
        start_byte = os.path.getsize(local_path)
        headers['Range'] = f'bytes={start_byte}-'
    
    response = requests.get(url, headers=headers, stream=True, timeout=timeout)
    
    mode = 'ab' if start_byte > 0 else 'wb'
    with open(local_path, mode) as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
            f.flush()

Third-party wget Module

As mentioned in Answer 1, Python's wget module provides simple download functionality:

import wget
filename = wget.download(url)

However, this module hasn't been updated since 2015, lacks advanced features like custom timeout and infinite retry, and has compatibility issues with certain file types.

Performance Optimization and Error Handling

Chunk Size Optimization

Appropriate chunk size significantly impacts download performance. Smaller chunks (e.g., 1KB) increase system call overhead, while larger chunks (e.g., 1MB) may perform poorly in high-latency networks. 8KB-64KB typically provides a good balance.

Refined Exception Handling

In practical applications, different types of network errors should be distinguished:

try:
    # Download logic
except TimeoutError:
    # Timeout error, retry immediately
except ConnectionResetError:
    # Connection reset, wait before retry
    time.sleep(5)
except urllib.error.HTTPError as e:
    if e.code == 416:  # Range Not Satisfiable
        # Handle case where file is already fully downloaded
        os.remove(local_path)  # Delete corrupted file
        retry_count = 0  # Reset retry count

Practical Application Recommendations

Environment Configuration Considerations

As discussed in the reference article, compatibility across different Python versions and environments is crucial. Using virtual environments for dependency management and pyenv for multi-version Python management is recommended:

# Install specific Python version using pyenv
pyenv install 3.11.9
pyenv local 3.11.9

Production Environment Deployment

In production environments, consider adding the following features:

• Download progress display

• Download speed limiting

• Concurrent download support

• Download queue management

• Comprehensive logging

Conclusion

Through in-depth analysis of the urllib.request module's core mechanisms, we have successfully implemented equivalent functionality to the wget command. Resume download based on HTTP Range headers, precise timeout control, and robust retry logic form a complete download solution. While the requests library and third-party modules offer alternative approaches, the standard library solution provides the best compatibility and control. In practical applications, choosing the appropriate implementation based on specific requirements, while fully considering error handling and performance optimization, enables building truly "unfailiable" download functionality.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.