Keywords: wget | HTTP directory download | recursive download
Abstract: This article provides a comprehensive guide on using wget command to recursively download all files and subdirectories from an HTTP directory, addressing the common issue of only downloading index.html files instead of actual content. Through in-depth analysis of key parameters including -r, -np, -nH, --cut-dirs, and -R, it offers complete command-line solutions and practical application examples to achieve download effects similar to local folder copying.
Problem Background and Challenges
When accessing online HTTP directories, users frequently encounter a common issue: using the wget command only downloads the index.html file containing file listings, rather than the actual file contents. This results in downloaded content that cannot be used directly and requires additional processing steps.
Core Solution
By combining multiple parameters of wget, complete directory recursive downloading can be achieved:
wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/
Detailed Parameter Analysis
Recursive Download (-r): Enables recursive mode, ensuring wget traverses all subdirectories and downloads files within them.
No Parent Directory Access (-np): Prevents wget from traversing to parent directories, ensuring the download scope is limited to the specified directory and its subdirectories.
Ignore Hostname Directory (-nH): Downloads files without creating a top-level directory named after the hostname, resulting in a cleaner file organization structure.
Directory Level Trimming (--cut-dirs=3): Ignores the first 3 directory levels in the URL path when saving files. For example, for the path http://hostname/aaa/bbb/ccc/ddd/, downloaded files will be saved directly under the ddd directory instead of the complete path structure.
Exclude Specific Files (-R index.html): Excludes all index.html files during the download process, avoiding the download of useless directory listing pages.
Practical Application Example
Assuming you need to download all content from http://example.com/data/project/docs/reference/, you can use the following command:
wget -r -np -nH --cut-dirs=4 -R index.html http://example.com/data/project/docs/reference/
Here --cut-dirs=4 corresponds to ignoring the first 4 directory levels before data, project, docs, and reference.
Advanced Configuration Options
For scenarios requiring resumable downloads or downloading only updated files, you can add the -N parameter:
wget -r -N -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/
The -N parameter enables timestamp comparison, downloading files only when the remote file is newer than the local copy, effectively saving bandwidth and time.
Error Handling and Debugging
In complex network environments, it's recommended to add the -t parameter to specify retry attempts:
wget -r -np -nH --cut-dirs=3 -R index.html -t 5 http://hostname/aaa/bbb/ccc/ddd/
This configuration automatically retries up to 5 times if downloads fail, improving success rates.
Conclusion
By properly combining wget parameters, users can efficiently download all content from HTTP directories, achieving directory structures identical to local file systems. This method is particularly suitable for website mirroring, data backup, and batch file download scenarios.