Regular Expressions
Introduction
Basic Syntax
Four Basic Symbols
Brace Yourself
Or Operator
The Usefulness of Brackets
The Backlash in PHP
Regular Expressions Examples
Introduction
Regular Expressions might as well be called Nightmare Expressions, since mastering them is everything but easy. However programmers should keep in mind that trying to avoid the use of regular expressions may lead to chunks of code that more tedious to write than any regular expression.
Basic Syntax
If you are using Linux, the egrep command is a blessing for those who want to use regular expressions. If you aren't using Linux, you can still understand what's going on. Go to the Linux console, using touch create a document called band_names. Then add these names to the file:
Brocas Helm
Judas Priest
Cauldron Born
Iron Maiden
Black Sabbath
Dokken
Blue Oyster Cult
Accept
Deep Purple
Nightwish
UDO
If you like hard rock, some of these names will sound familiar to you. But the purpose of this article is not testing your knowledge of rock music. Suppose you want to know if the heavy metal band Cauldron Born is in the list. This is what you type:
The output will be: Cauldron Born
Here -i is a pattern modifier that tells egrep not to be case sensitive. If we typed, egrep -i 'born' instead, the output would be the same. Sometimes, however, we want to know if a line or expression begins with a certain word. Typing cat band_names | egrep -i '^born' yields no output, since the line begins with the word Cauldron, not Born. Anything that follows the special symbol '^' must be at the start of an expression or string. Therefore, cat band_names | egrep -i '^cauldron' will have Cauldron Born an output.
But what if we want to find a word or phrase at the end of a string or expression? The symbol '^' would be useless. What we now need is the symbol '$', like this: cat band_names | egrep -i 'priest$'
And the output will be the name of a legendary heavy metal band: Judas Priest.
Now we know that ^ and $ mark the starts and ends of strings. Using cat band_names | egrep -i '^deep purple$' will produce the name Deep Purple.
Four Basic Symbols
The symbols '.', '*', '+', and '?' are extremely useful regular expressions. We begin with the simplest one: the dot. The dot is the simplest regular expression character. It stands for 'any character'. Let's see a few examples using egrep:
egrep -i '^i.o' band_names --> outputs: Iron Maiden
egrep -i 'e.m$' band_names --> outputs: Brocas Helm
egrep -i 'o..e' band_names --> outputs: Dokken
The symbol * is similar to '.' except that it means 'zero or more', and this meaning applies to the preceding characters. Hence the regular expression 'a*t' means there may be zero or more a's in the expression followed by a 't' (t, at, aat, aaat, aaaat...). If we have the expression 'ab*', anything from 'a' to 'abb' to 'abbbbbbbbbbbbbbbbb(even more b's)' can be a match. It is worth mentioning that '.*' stands for zero to several unknown characters.
The plus sign '+' is a little less forgiving than '*' since it means 'one or more'. That being the case 'ab+' can mean 'ab', and it can mean 'abbb(even more b's)', but it never means 'a'.
Finally we have ? The nature of this symbol in everyday writing should tell you, more or less, what it stands for in regular expression. The expression 'ab?', for example, means there's an a, and there might be a b.
Brace Yourself
Symbols like '*', '+', '.', and '?' can be very useful, but sometimes we may need more precision in what we are searching. We already know that 'ab+' can mean 'ab' as well as it can mean 'abbbbbbbb', but what if we are searching for a string that only has two b's? In cases like this, braces {} are very useful. In fact braces are a lot better than '*' and '+' since we can tell egrep how many times we want a character to repeat itself. Therefore, abb can be replaced by ab{2}, and ab* can be replaced by ab{0,}. Notice that there's no number after the comma, which means the a may or may not be followed by b's. You can be more specific than this and type 'ab{0,2} (a, ab, or abb) or ab{1,2} (ab or abb).
echo 'Abbba is great' | egrep -i 'b{2}a' --> There's no output, since there are three b's instead of two.
echo 'Abba is great' | egrep -i 'b{3}a' --> There's no output; there are two b's, and we need three.
We will encounter situations where we want to find a combination of characters repeating themselves. We already know that if we use something like ab+ or ab* only the letter b is affected. We can solve this problem using parentheses. Having (ab)+ means there's one or more repetitions of 'ab' in the string.
OR Operator
When it comes to PHP, || stands for the OR operator. In regular expressions, the symbol '|' performs the same function. If you want to find Judas Priest and Iron Maiden from the file called band_names, this is what you type:
egrep -i '^(ju){1}|(en)+$' band_names ---> outputs Judas Priest, Iron Maiden, Dokken. If you can't understand why, it'd be a good idea to re-read this tutorial before you proceed.
egrep -i 'b(r|l)' band_names --> Brocas Helm, Black Sabbath, Blue Oyster Cult
The Usefulness of Brackets
If you can understand everything so far, and you can understand the role of brackets in regular expressions, everything else will be easy to you. Brackets specify which characters are allowed in a single position of a string. Typing '[ok]' is no different than typing (o|k). Likewise typing [abc] is a replacement for (a|b|c). But brackets aren't just a replacement for '|'. The expression [a-c] is the same as [abc], and typing [a-z] will save you from the hassle of having to type (a|b|c|d|e|f...etc). The proper use of brackets can encompass several characters. For example, [a-zA-Z0-9] covers all the alphanumeric characters.
Go to Page 2 >>



