Downloading Websites with WinHTTrack

An excellent open source tool called WinHTTrack enables downloading websites for archiving, backups, and analysis. When direct access to a website is not available, this tool can be very useful for creating an offline backup or archive before a replacement website is taken live.

There are three primary use cases for WinHTTrack: creating a backup of an old site before it is moved to a new host, archiving a useful website for offline use, or downloading a full offline version of a website for semantic analysis. With its plethora of options, WinHTTrack adapts itself easily to each situation.

The bulk of WinHTTrack’s options are available under the “Preferences and mirror options” button, on the same screen where the website URLs are entered. Below are the recommended options for each scenario:

Website Backup

It is ideal to create two archives of a website backup – one virgin download without local file renaming, and one full download with link replacement.

Scan Rules
- Select all three file wildcards to a full site download
Limits
- Leave Mirroring depth empty
- Increase the transfer rate to a proper rate for your Internet connection
Links
- Check “Get non-HTML files related to a link
- Check “Get HTML files first”
Build
- Check “No external pages”
Spider
- Set “no robots.txt rules”
Experts Only
- For virgin download, select “Original URL” under “Rewrite links”, otherwise select “Relative URI”

Archiving a Site for Offline Use

When creating an offline site, it’s important to optimize for the target content. If only certain files types or URL patterns are necessary, limit the crawl to these areas. Use the same settings as the Website Backup, except with the following changes:

Scan Rules
- Limit file type to target content
Links
- Uncheck “No external pages”

Semantic / Content Analysis

Semantic analysis is mainly concerned with processing the original HTML content. In this situation, non-HTML files can be ignored. Use the same settings as Website Backup, except with the following changes:

Scan rules
- Do not check any of the additional content types. Remove png, gif, and jpg
Links
- Do not check “Get non-HTML files
Build
- Do not check “No external pages”
Experts Only
- Choose “Original URL” under “Rewrite Links”

With its flexibility in crawling settings, WinHTTrack is an invaluable tool. Its support for both JavaScript and CSS analysis allow access to hard-to-reach content that would be difficult to download using a basic web crawler. For simple applications where writing a custom crawler would be overkill, the tool can function as an excellent gateway for downloading content before text or other analysis. A word to the wise however – be sure to keep an eye on the crawler the first few times you run – it can sometimes fall into an infinite loop when processing malformed server URLs or content. While a custom web crawler can offer more stability and performance, WinHTTrack is still an excellent solution for quick download jobs and rapid content analysis.

Written by Andrew Palczewski

About the Author
Andrew Palczewski is CEO of apHarmony, a Chicago software development company. He holds a Master's degree in Computer Engineering from the University of Illinois at Urbana-Champaign and has over ten years' experience in managing development of software projects.
Google+

3 thoughts on “Downloading Websites with WinHTTrack”

Tomasz says:

September 3, 2016 at 5:06 am

How to save that 3 stages – download separetly each stage and until automaticly stop, or download once with all settings by one time ?

1. Andrew Palczewski says:
  
  September 5, 2016 at 7:42 pm
  
  Tomasz, it depends on your use case. If you just want to take a backup of the website as part of an upgrade, use the first option. If you want a full offline archive of a site (for example, if you are taking a site down for good), use the second option. Finally, for programming semantic algorithms, use the third option.
  
Kyr says:

September 6, 2022 at 7:58 pm

What options can I set so that it will download my entire website, every file?

Website Backup

Archiving a Site for Offline Use

Semantic / Content Analysis

3 thoughts on “Downloading Websites with WinHTTrack”

Leave a Reply Cancel reply