Windows Powershell offers advanced command-line scripting, bringing the .Net framework to traditional batch files. One of its primary drawbacks, however, is the scarcity of good documentation available for the language. The regular expression engine, while powerful, has a custom, often undocumented syntax for advanced operations such as multi-line search and replace.
Multi-line regular expressions are particularly useful when parsing HTML or XML documents. Since XML elements can contain multiple lines between their opening and closing tags, a more flexible engine is required when parsing and altering these documents.
In order to replace multi-line strings using Powershell, the first step is to load the text of the target file into memory. This can be accomplished in one line of code by taking advantage of either .Net framework functions, or by using the RAW file read format. Since the .Net framework method provides more flexibility, we will use that in this example:
[IO.Directory]::SetCurrentDirectory((Convert-Path (Get-Location -PSProvider FileSystem)))
$filetxt = [IO.File]::ReadAllText("[File Name]")
The first line, though not required, can provide better path referencing by setting the .Net current directory. Without this command, the .Net framework may use a different local folder, such as the system root, and wreak havoc with relative paths. Next, the ReadAllText function takes an input file path and stores its entire file’s contents into a string variable.
With the file loaded into memory, the regular expression commands can begin. First, we will review an example of a standard, single-line regular expression:
$filetxt = ($filetxt -replace ".*<Setting Name=""ConnectionString"".*/>.*", "")
This command will remove all lines containing the ConnectionString Setting from the target file. The “.*” syntax is a wildcard that will match any text. Double-quotes will need to be escaped with two double-quotes in the Powershell language.
Multi-line regular expressions, on the other hand, require a regular expression mode modifier, and custom wildcard syntax:
$filetxt = ($filetxt -replace "(?ms)^\s+<Setting Name=""ConnectionString"".*?</Setting>", "")
The “(?ms)” prefix is called a “Mode Modifier”. The “m” modifier enables multi-line search, while the “s” modifier enables the wildcard modifier to include line breaks, so that the wildcards will span multiple lines.
The problem with this command is that it will, by default, match the largest possible expression. In order to instead match the smallest possible expression, the question mark is added to the wildcard: “.*?”. The question mark at the end of the wildcard signifies that the parser should try to minimize the length of the wildcard string. It’s generally recommended to only include one such wildcard in a multi-line regular expression, since multiple multi-line wildcards can produce unexpected behavior.
Finally, with the regular expression operations complete, the in-memory string can be written back to disk:
Set-Content -Path "[File Name]" -Value $filetxt
Since multi-line regular expression replacement is often even more volatile than single-line regular expression replacement, it is highly recommended to store a backup of any files that will be processed. Even with this volatility, however, a well-written Powershell regex program can save significant development time, and streamline both deployment and data processing operations.
Written by Andrew Palczewski
About the Author
Andrew Palczewski is CEO of apHarmony, a Chicago software development company. He holds a Master's degree in Computer Engineering from the University of Illinois at Urbana-Champaign and has over ten years' experience in managing development of software projects.