An Introduction to Regular Expressions

Posted by buildmeister on December 8th, 2005
Filed under:  miscellaneous 
There are 0 comments on this article.
Bookmark and Share

Introduction

A decent knowledge of regular expressions (sometimes known as regexps or REs) is a fundamental skill that any buildmeister should acquire. Regular expressions are a language subset (they are common to many different programming and scripting languages) that allow you to specify a set of rules for a particular text pattern that you want to match. For example, searching through a source code file to find all the classes that being with the string "admin". In this article we will introduce the different aspects of the regular expression language, how you can make use of it, and how different languages support it.

The basics

The simplest form of a regular expression is one that searches for a particular text string without any special characters, thus:

xxx

will find all lines of a file or part of a string that contain three letters x in succession at any point. Now obviously, this is a great start but is probably rare that we know the exact sequence of characters that we wish to match or more likely we want to only match a sequence of characters at specific points in a file. For example, if we wanted to match the above string but only if it was at the beginning of a line we  could use the special character ^ as follows:

^xxx

Likewise you can use the $ character to only match the string if it appears only at the end of a line.

Wild Cards

Regular expressions can include a number of wild card symbols. A period can be used to specify exactly one arbitrary character, for example:

x.

will find all strings containing the letter x followed by any character, i.e. xa, xb, xc and so on. Asterisks can be used to specify the repetition of an expression zero or more times, for example:

abx*

will find all strings containing the letters ab followed by zero or more x's, i.e. ab, abx, abxxxx and so on. Similarly the + sign can be used to specify the repetition of a string but at least one time, for example:

abx+

will find all strings containing the letters ab followed by one or more x's, i.e. abx, abxx and so on.

Sets of characters can be specified by enclosing the characters in square brackets, for example:

x[ab]

will find the strings xa or xb. Ranges of characters can be specified using the - character, for example:

x[a-c]

will find the strings xa, xb or xc. You can also use the ^ character to specify the characters that do not belong to the set, for example:

x[^a-b]

will not match either xa or xb but xc, xd and so on. Finally, you can use the | character to specify an OR relationship for regular expressions, for example:

xa|xb

will find the strings xa or xb. If you want to match any of these characters without using it's special meaning then you can escape it using the \ character.

Pattern Symbols

The regular expression language has a number of pre-defined symbols that match sets or groupings of characters, for example the \d symbol will match any (number) digit, conversely the \D symbol will match any non (number) digit. The \w symbol will match any word character consisting of letters, numbers or the underscore character (which is equivalent to the set [a-zA-z_0-9]) and \W will match any non word character. Finally, the \s symbol will match any whitespace character (tabs, spaces, new-lines and so on] whilst \S will match any non whitespace character. A number of whitespace characters can also be matched explictly, for example to match a tab, you can use \t, a newline \n and a carriage return \r.

Expression Meaning
x Match one letter x
^ Match at beginning of line
$ Match at end of line
. Match any character (except newline)
* Match zero or more times
+ Match one or more times
? Match zero or one time
[xxx] Match a set of characters
[a-z] Match a range of characters
[^xxx] Match characters that do not belong to the set
a|b Match one or other character
\d Match any digit [0-9]
\D Match any non-digit [^0-9]
\w Match any word [a-zA-Z_0-9] character
\W Match any non word character
\s Match any whitespace character
\S Match any non whitespace character
\x Cancel character special meaning and match exactly

Some Examples

In Perl you can specify regular expressions between / delimiters (you can change the delimiter if you wish), and you add modifiers after the closing /. For example, to carry out a global search and replace of the pattern xxx with yyy you could use the expression:

s/xxx/yyy/g

to match the contents of a variable to a regular expression, you can use the =~ operator as in the following code:

$str = "Hello World";
$str =~ s/Hello/Goodbye/;
print $str;
Goodbye World

Java2 version 1.4 and onwards includes the regular expression class java.util.regex. In Java you specify a regular expression as a String, which must first be compiled into an instance of the class Pattern. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. For example:

Pattern p = Pattern.compile("Hello");
Matcher m = p.matcher("Hello World");
myMatcher.replaceAll("subject","replacement")
boolean b = m.matches();

Visual Basic, C++ and C# amongst other languages can make use of the shared regular expression library provided under Microsoft's .NET framework.

Summary

This article was intended as an introduction to using regular expressions. The table below summarises the language. There is more to the regular expression language and in fact complete books have been written on the subject, and I therefore encourage you to view the references listed below in more detail.

References

Bookmark and Share

Comments

There are no comments on this article.

Back to Top

Submit a new comment

All fields in bold are required.