A Comprehensive Guide to HTTP File Downloading and Saving to Disk in Python

Keywords: Python | file download | HTTP | urllib | requests

Abstract: This article provides an in-depth exploration of methods to download HTTP files and save them to disk in Python, focusing on urllib and requests libraries, including basic downloads, streaming, error handling, and file extraction, suitable for beginners and advanced developers.

Introduction

Downloading files from HTTP servers and saving them to local disk is a common task in Python, especially for data scraping, automation scripts, and web development. Based on high-scoring Stack Overflow answers and authoritative references, this article systematically explains how to achieve this using Python's standard and third-party libraries, covering scenarios from basic to advanced.

Downloading Files with urllib

The urllib module in Python's standard library offers a straightforward way to download files. For beginners, the urlretrieve function is an excellent choice as it encapsulates the entire process of requesting and saving. Here is a basic example:

import urllib.request

url = "http://example.com/file.gz"
filename = "file.gz"
urllib.request.urlretrieve(url, filename)

This code downloads a file from the specified URL and saves it locally. The function returns a tuple containing the file path and HTTP response headers, which can be used for further analysis, such as verifying content type and file size.

Downloading Files with requests

While urllib is simple, the requests library provides richer features and better error handling. First, install the library: pip install requests. Then, use the following code:

import requests

url = "http://example.com/file.gz"
response = requests.get(url)
with open("file.gz", "wb") as file:
    file.write(response.content)

Here, response.content contains the binary data of the file, saved using write mode. Compared to urllib, requests supports more complex HTTP operations, such as handling redirects and authentication.

Handling Large Files and Streaming Downloads

For large files (e.g., over 500MB), loading the entire content into memory can cause performance issues. Using the streaming mode in requests allows downloading data in chunks:

import requests

url = "http://example.com/large_file.gz"
response = requests.get(url, stream=True)
with open("large_file.gz", "wb") as file:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            file.write(chunk)

This method reads data piece by piece, reducing memory usage and enabling real-time processing, such as computing hashes or extracting parts of the content during download.

Error Handling and Best Practices

In practical applications, network requests may fail, so adding error handling is essential. Use try-except blocks to catch exceptions:

import urllib.request
import requests

try:
    # Example with urllib
    urllib.request.urlretrieve("http://example.com/file.gz", "file.gz")
except Exception as e:
    print(f"Download failed: {e}")

# Or with requests
try:
    response = requests.get("http://example.com/file.gz")
    response.raise_for_status()  # Check for HTTP errors
    with open("file.gz", "wb") as file:
        file.write(response.content)
except requests.exceptions.RequestException as e:
    print(f"Request error: {e}")

Additionally, it is advisable to use absolute paths for saving files to avoid issues from changing working directories. On Windows, paths should use double backslashes or raw strings, e.g., r"D:\folder\file.gz".

Extracting and Processing Downloaded Files

Downloaded .gz files can be extracted using Python's gzip module. The following code demonstrates how to download and extract a file:

import urllib.request
import gzip
import shutil

# Download the file
url = "http://example.com/file.gz"
filename_gz = "file.gz"
urllib.request.urlretrieve(url, filename_gz)

# Extract the file
filename_extracted = "file.txt"
with gzip.open(filename_gz, "rb") as f_in:
    with open(filename_extracted, "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)

After extraction, the file content can be used for further processing, such as reading text or parsing data. For other formats like ZIP, the zipfile module can be used similarly.

Advanced Topics: Parallel Downloads and Asynchronous Processing

For downloading multiple files, parallel processing can improve efficiency. Using ThreadPoolExecutor from the concurrent.futures module enables multithreaded downloads:

from concurrent.futures import ThreadPoolExecutor
import requests

def download_file(url):
    response = requests.get(url)
    filename = url.split("/")[-1]
    with open(filename, "wb") as file:
        file.write(response.content)
    print(f"Downloaded: {filename}")

urls = ["http://example.com/file1.gz", "http://example.com/file2.gz"]
with ThreadPoolExecutor() as executor:
    executor.map(download_file, urls)

For I/O-bound tasks, multithreading can significantly reduce total download time. Furthermore, asynchronous libraries like aiohttp are suitable for high-performance scenarios but are more complex and ideal for advanced users.

Conclusion

Python offers various tools for downloading HTTP files, from simple urllib.urlretrieve to the feature-rich requests library. When choosing a method, consider file size, error handling needs, and performance. Beginners should start with urllib, while advanced users can explore streaming downloads and parallel processing. By applying these techniques, you can efficiently automate file downloading tasks.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.