Using Perl and Regular Expressions to Process HTML Files - Part 1
|
|
|
| 2.4/5.0 (7 votes total) |
|
|
|
John Dixon August 16, 2007
|
Like many web content authors, over the past few years I've had many
occasions when I've needed to clean up a bunch of HTML files that have
been generated by a word processor or publishing package. Initially, I
used to clean up the files manually, opening each one in turn, and
making the same set of updates to each one. This works fine when you
only have a few files to fix, but when you have hundreds or even
thousands to do, you can very quickly be looking at weeks or even
months of work. A few years ago someone put me on to the idea of using
Perl and regular expressions to perform this 'cleaning up' process.
Why
write an article about Perl and regular expressions I hear you say.
Well, that’s a good point. After all the web is full of tutorials on
Perl and regular expressions. What I found though, was that when I was
trying to find out how I could process HTML files, I found it difficult
to find tutorials that met my criteria.
I'm not saying they don't exist, I just couldn't find them. Sure, I
could find tutorials that explained everything I needed to know about
regular expressions, and I could find plenty of tutorials about how to
program in Perl, and even how to use regular expressions within Perl
scripts. What I couldn’t find though, was a tutorial that explained how
to open one or more HTML or text files, make updates to those files
using regular expressions, and then save and close the files.
The Goal
When converting documents into HTML the goal is always to achieve a
seamless conversion from the source document (for example, a word
processor document) to HTML. The last thing you need is for your
content authors to be spending hours, or even days, fixing untidy HTML
code after it has been converted.
Many applications offer excellent tools for converting documents to
HTML and, in combination with a well designed cascading style sheet
(CSS), can often produce perfect results. Sometimes though, there are
little bits of HTML code that are a bit messy, normally caused by
authors not applying paragraph tags or styles correctly in the source
document.
Why Perl?
The reason why Perl is such a good language to use for this task is
because it is excellent at processing text files, which let's face it,
is all HTML files are. Perl is also the de facto standard for the use
of regular expressions, which you can use to search for, and
replace/change, bits of text or code in a file.
What is Perl?
Perl (Practical Extraction and Report Language) is a general purpose
programming language, which means it can be used to do anything that
any other programming language can do. Having said that, Perl is very
good at doing certain things, and not so good at others.
Although you could do it, you wouldn’t normally develop a user
interface in Perl as it would be much easier to use a language like
Visual Basic to do this. What Perl is really good at, is processing
text. This makes it a great choice for manipulating HTML files.
What is a Regular Expression?
A regular expression is a string that describes or matches a set of
strings, according to certain syntax rules. Regular expressions are not
unique to Perl - many languages, including JavaScript and PHP can use
them - but Perl handles them better than any other language.
In part 2, we'll look at our first example Perl script
|