Downloading Websites with WinHTTrack

Web DevelopmentAn excellent open source tool called WinHTTrack enables downloading websites for archiving, backups, and analysis.  When direct access to a website is not available, this tool can be very useful for creating an offline backup or archive before a replacement website is taken live.

There are three primary use cases for WinHTTrack: creating a backup of an old site before it is moved to a new host, archiving a useful website for offline use, or downloading a full offline version of a website for semantic analysis.  With its plethora of options, WinHTTrack adapts itself easily to each situation.

The bulk of WinHTTrack’s options are available under the “Preferences and mirror options” button, on the same screen where the website URLs are entered.  Below are the recommended options for each scenario:

Website Backup

It is ideal to create two archives of a website backup – one virgin download without local file renaming, and one full download with link replacement.

  1. Scan Rules
    • Select all three file wildcards to a full site download
  2. Limits
    • Leave Mirroring depth empty
    • Increase the transfer rate to a proper rate for your Internet connection
  3. Links
    • Check “Get non-HTML files related to a link
    • Check “Get HTML files first”
  4. Build
    • Check “No external pages”
  5. Spider
    • Set “no robots.txt rules”
  6. Experts Only
    • For virgin download, select “Original URL” under “Rewrite links”, otherwise select “Relative URI”

Archiving a Site for Offline Use

When creating an offline site, it’s important to optimize for the target content.  If only certain files types or URL patterns are necessary, limit the crawl to these areas.  Use the same settings as the Website Backup, except with the following changes:

  1. Scan Rules
    • Limit file type to target content
  2. Links
    • Uncheck “No external pages”

Semantic / Content Analysis

Semantic analysis is mainly concerned with processing the original HTML content.  In this situation, non-HTML files can be ignored.  Use the same settings as Website Backup, except with the following changes:

  1. Scan rules
    • Do not check any of the additional content types.  Remove png, gif, and jpg
  2. Links
    • Do not check “Get non-HTML files
  3. Build
    • Do not check “No external pages”
  4. Experts Only
    • Choose “Original URL” under “Rewrite Links”

With its flexibility in crawling settings, WinHTTrack is an invaluable tool.  Its support for both JavaScript and CSS analysis allow access to hard-to-reach content that would be difficult to download using a basic web crawler.  For simple applications where writing a custom crawler would be overkill, the tool can function as an excellent gateway for downloading content before text or other analysis.  A word to the wise however – be sure to keep an eye on the crawler the first few times you run – it can sometimes fall into an infinite loop when processing malformed server URLs or content.  While a custom web crawler can offer more stability and performance, WinHTTrack is still an excellent solution for quick download jobs and rapid content analysis.

Written by Andrew Palczewski

About the Author
Andrew Palczewski is CEO of apHarmony, a Chicago software development company. He holds a Master's degree in Computer Engineering from the University of Illinois at Urbana-Champaign and has over ten years' experience in managing development of software projects.
Google+

RSS Twitter LinkedIn Facebook Email

3 thoughts on “Downloading Websites with WinHTTrack”

  1. How to save that 3 stages – download separetly each stage and until automaticly stop, or download once with all settings by one time ?

    1. Tomasz, it depends on your use case. If you just want to take a backup of the website as part of an upgrade, use the first option. If you want a full offline archive of a site (for example, if you are taking a site down for good), use the second option. Finally, for programming semantic algorithms, use the third option.

Leave a Reply

Your email address will not be published. Required fields are marked *