Microsoft PowerShell is an incredibly flexible batch scripting engine – essentially the Swiss Army knife of Windows programming. Although the language itself is derivative of Batch files, it gains significant capability through .NET integration. PowerShell scripts can access SQL databases, execute shell commands, parse file and directory structures, and perform I/O, mostly through one-line commands.
What Windows lacks, however, are the plethora of useful command-line executables that are available on Linux. Fortunately, the GnuWin32 and Cygwin projects have helped port the majority of those applications to Windows. These tools such as GREP, RSYNC, and WGET are so ubiquitous, useful, and easy to install, that it would not be surprising to see an automated installer for these tools in the near future, similar to the YUM or APT-GET repository downloaders on Linux.
The trick to converting HTML to PDF is leveraging one of these tools – specifically WKHTMLTOPDF. This command-line tool can take any web page and render it to PDF using the QT Webkit rendering engine. In addition to generating PDFs, a sister command-line program called WKHTMLTOIMAGE can instead create images.
Although other libraries exist to perform a similar action, WKHTMLTOPDF is unique in that it also supports server authentication and a wide breadth of command-line arguments. This can be useful for automated jobs that need to access data behind a secured server and could not do so otherwise.
The first step to generating PDFs from websites is to download the WKHTMLTOPDF library and install on a Windows PC. Precompiled binaries are available for Windows, or more adventurous users can try to compile the software themselves.
With the program installed, the PowerShell script is relatively simple:
&"c:\Program Files (x86)\wkhtmltopdf\wkhtmltopdf.exe" --username "$USERNAME" --password "$PASSWORD" "$URL" "$OUTPUTFILE"
In this example, the four parameters are variables, so they could be used within a loop. Since the QT Webkit rendering engine is an advanced headless browser, it supports NTLM as well as Basic Authentication. Running this command will log in to the server with the username and password, and convert the content of the URL into a PDF with destination $OUTPUTFILE.
This PDF can then be emailed or archived. Applications that utilize this technique can automatically generate PDF reports and email them to customers, or even generate and publish content directly to websites. Automated PDF generators can help replace costly software such as Adobe Distiller, and finally bring PDF generation to the masses through the LGPL license.
Written by Andrew Palczewski
About the Author
Andrew Palczewski is CEO of apHarmony, a Chicago software development company. He holds a Master's degree in Computer Engineering from the University of Illinois at Urbana-Champaign and has over ten years' experience in managing development of software projects.