SmallBusiness

CGIDir

Navigation

Search

Search by Category

Tutorials

Home

Guides

Beginner

Article

Using Perl and Regular Expressions to Process HTML Files - Part 3


	3.0/5.0 (5 votes total)
	Rate:

August 27, 2007

In Part 1 we had a quick look at what Perl and regular expressions are, and introduced the idea of using them to process HTML files. In Part 2 we developed a Perl script to process a single HTML file. In this part we'll look at how to process multiple files.

The script we looked at in Part 2 (script1.pl - repeated below for convenience) has one major drawback, making it unusable in real terms: the name of the web page (HTML file) that the script processes is hard coded into the script itself. For the script to be useful, we need to be able to run it on any web page. Changing the script so that it can do this is fairly straightforward.

Below, I've given two scripts: script1.pl, which was our original script from Part 2, and script2.pl, which is a new script that will process a list of files.

Note: Due to display considerations, in the example code shown in this article, square brackets '[..]' are used in HTML/script tags instead of angle brackets '<..>'.

script1.pl

1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = [IN]) {
4 $line =~ s/[h1]/[h1 class="big"]/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);

script2.pl

1 foreach $file (@ARGV) {
2 rename $file, "$file.bak";
3 open (IN, "<$file.bak");
4 open (OUT, ">$file");
5 while ($line = [IN]) {
6 $line =~ s/[h1]/[h1 class="big"]/;
7 (print OUT $line);
8 }
9 close IN;
10 close OUT;
11 }

Before looking at each line of the script in detail, let’s just quickly establish what script2.pl does. Well, it processes one or more files entered at the command line prompt (for example, the MS-DOS prompt) and then, for each file entered, the script initially makes a backup copy before changing every occurrence of [h1] to [h1 class="big"].

A few quick definitions:

Variable A temporary storage place for a value. In the above script, $file is a variable. The filename file1.htm, which will be entered at the command line prompt, is a value that will be temporarily stored in that variable when the script is run.

Array A storage place for a list of values.

Let’s take a look at each line of script2.pl.

Line 1
This line enables one or more files to be entered at the command line and processed by the script. We only have one file, ‘file1.htm’, so when we run the script we’ll only enter one file to be processed.

Line 2
This line makes a backup copy of each file before processing it. So, for ‘file1.htm’, the backup file would be ‘file1.htm.bak’.

Line 3
This line opens a filehandle for the file being processed. Part 2 of this series of articles gives more information about filehandles.

Line 4
This line opens another filehandle, but this time for the output from the script.

Note: file1.htm.bak will contain the contents of the file from before the script is run. file1.htm will contain the updated contents, that's to say, the output from the script.

Line 5
This line sets up a loop in which each line in the input file (the file being processed) will be examined individually.

Line 6
This is the regular expression. It searches for one occurrence of [h1] on each line of the input file and, if it finds one, changes it to [h1 class="big"].

See Part 2 for a full description of the actual regular expression.

Line 7
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to the output file.

Line 8
This line closes the ‘while’ loop. The loop is repeated until all the lines in the file currently being processed have been examined.

Lines 9 and 10
These two lines close the two file handles that have been used in the script.

Line 11
This line closes the 'foreach' loop. The loop is repeated until all the files entered at the command line prompt have been processed.

Running the script

To run the script, at the command line type:

C:>perl script2.pl file1.htm

If the script executes successfully, a new file should be created called file1.htm.bak, which is a backup of the orginal file (ie before it was processed). A new version of file1.htm should also have been produced, containing the modified [h1] tag.

In Part 4 we'll look at an alternative way of inputting/selecting files for processing.

About the Author: John Dixon is a web developer and technical author. These days, John spends most of his time developing dynamic database-driven websites using PHP and MySQL.