Complete Guide to Recursively Download HTTP Directory with All Files and Subdirectories Using wget

Keywords: wget | HTTP directory download | recursive download

Abstract: This article provides a comprehensive guide on using wget command to recursively download all files and subdirectories from an HTTP directory, addressing the common issue of only downloading index.html files instead of actual content. Through in-depth analysis of key parameters including -r, -np, -nH, --cut-dirs, and -R, it offers complete command-line solutions and practical application examples to achieve download effects similar to local folder copying.

Problem Background and Challenges

When accessing online HTTP directories, users frequently encounter a common issue: using the wget command only downloads the index.html file containing file listings, rather than the actual file contents. This results in downloaded content that cannot be used directly and requires additional processing steps.

Core Solution

By combining multiple parameters of wget, complete directory recursive downloading can be achieved:

wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/

Detailed Parameter Analysis

Recursive Download (-r): Enables recursive mode, ensuring wget traverses all subdirectories and downloads files within them.

No Parent Directory Access (-np): Prevents wget from traversing to parent directories, ensuring the download scope is limited to the specified directory and its subdirectories.

Ignore Hostname Directory (-nH): Downloads files without creating a top-level directory named after the hostname, resulting in a cleaner file organization structure.

Directory Level Trimming (--cut-dirs=3): Ignores the first 3 directory levels in the URL path when saving files. For example, for the path http://hostname/aaa/bbb/ccc/ddd/, downloaded files will be saved directly under the ddd directory instead of the complete path structure.

Exclude Specific Files (-R index.html): Excludes all index.html files during the download process, avoiding the download of useless directory listing pages.

Practical Application Example

Assuming you need to download all content from http://example.com/data/project/docs/reference/, you can use the following command:

wget -r -np -nH --cut-dirs=4 -R index.html http://example.com/data/project/docs/reference/

Here --cut-dirs=4 corresponds to ignoring the first 4 directory levels before data, project, docs, and reference.

Advanced Configuration Options

For scenarios requiring resumable downloads or downloading only updated files, you can add the -N parameter:

wget -r -N -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/

The -N parameter enables timestamp comparison, downloading files only when the remote file is newer than the local copy, effectively saving bandwidth and time.

Error Handling and Debugging

In complex network environments, it's recommended to add the -t parameter to specify retry attempts:

wget -r -np -nH --cut-dirs=3 -R index.html -t 5 http://hostname/aaa/bbb/ccc/ddd/

This configuration automatically retries up to 5 times if downloads fail, improving success rates.

Conclusion

By properly combining wget parameters, users can efficiently download all content from HTTP directories, achieving directory structures identical to local file systems. This method is particularly suitable for website mirroring, data backup, and batch file download scenarios.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.