Mac OS X Command Line 101
by Richard Burton
Advanced Command Line Editing In Mac OS X
Part XIV of this series...
September 6th, 2002
When you tell an Iowan a joke, you can see a kind of race going on between his brain and his expression.
- Bill Bryson, Lost Continent
This series is designed to help you learn more about the Mac OS X command line. If you have any questions about what you read here, check out the earlier columns, write back in the comments below, or join us in the Hardcore X! forum.
In the previous column (vi II: Electric Boogaloo) we left our hero vi (Ben Affleck) trying to create a new command (George C. Scott) that would allow the user (Ozzy Osbourne) to search for plain text (Robert Ludlum) in a way that was both slim and elegant (Callista Flockhart and Diana Rigg) instead of fat and bloated (Steve Balmer). How will he do it? Will Professor Tinkle discover the secret of the alien invasion? Will the cavalry (The Kiss Army) reach the beleaguered fort on time? Can Teddy the Wonder Ferret steer The HMS Wigglesworth away from the reef and save the princess from the clutches of The Dread Pirate Snodgrass (Eric Idle)? Tune in ... now!
Okay, now that I've taken my medicine, perhaps you should as well. Today we will cover a topic that may give you the cold sweats and curl your nose hair: regular expressions. On the other hand, once you see what you can do with regular expressions, you might get a geekgasm due to the unutterable coolness.
Having several friends who do not speak American English as their native language, I often find myself facing a befuddled look or a guffaw because I will use a phrase that is not taught in English classes. (I would offer an example, but this is a family Web site. What can I say, I grew up with preachers' kids.) In truth, English can be a dangerous language in this regard; it is chock full of expressions. If you hear someone say "That let the cat out of the bag", you know that there was no actual cat in an actual bag; it's just an expression that is not to be taken literally, but within the context of what is happening at that time. Regular expressions are Unix' equivalent to such phrases.
Many Unix tools use regular expressions: ed, ex, vi, emacs, sed, awk, grep, egrep, and so on. Regular expressions were first written for Unix by a man named Henry Spencer. if I remember correctly, with only minimal features. As each tool was written, some authors would enrich the regular expression syntax for that particular tool; unfortunately, that enrichment was not always seen in the other tools that existed. (Perl is probably the king of this; I sometimes see features in its regular expression syntax that I cannot imagine anyone ever needing ... but apparently someone did.) We will cover the basic syntax that is common across all utilities that use regular expressions, then move on from there.
Before we start, open our previous test file with the vi testfile command; once again, we will use it for practice. In command mode, type /the and hit RETURN. This should move the cursor to the next occurrence of the word 'the'. What happens is that vi saw the search command, "/" and noted that it was followed by 'the'. vi then began to search forward for the pattern 'the': a 't', followed by an 'h', followed by an 'e'. Finding this, vi then moved the cursor to the place where this pattern began. While a cool (and time-saving) feature, just matching a plain old string is a bit limited. For example, the pattern 'the' will not only match the word 'the', it will also match the beginning of 'then', the middle of 'apothecary', and the end of 'breathe'. Worse, it would miss the word at the beginning of a sentence, because 'The' is not the same as 'the'. The computer does not store a capital 'T' as the same value as a lowercase 't'. What we really want is something more clever and more flexible, a way to say "look for the next occurrence of the word 'the' and only the word 'the', whether it is at the beginning of a sentence or not'. That's where regular expressions come in.
A regular expression uses symbols to represent text in an abstract way that is not to be taken literally. They are a bit like the metacharacters that the shell uses, but please don't confuse them. For one thing, the syntax is quite different. An asterisk, for example, means something different as a metacharacter than in a regular expression. Also, a metacharacter can be used pretty much anywhere on the command line, while a regular expression can be used only when and where a particular tool/utility is expecting one. (In the case of vi, this is when you hit the "/" command to search forward or "?" command to search backward.)
As you could probably guess from our example above, a letter will match itself, as will a number, and a space matches a space. Obvious, really, but we've all missed obvious things in the past.
What if you want your expression to be more flexible? Suppose you are looking for the next occurrence of any of 'the', 'tee', 'tie', or 'toe'. What you want is a wildcard to say "a 't' followed by any character followed by an 'e'". The period, '.', does this. It represents any character. [*] In vi, you would type /t.e and the cursor would move to the next occurrence of 'the', 'tee', 'tie', or 'toe'.
Or would it? Remember that the '.' can represent any character. What if the word "intrepid" were in the file? That contains a 't', followed by some character (an 'r'), followed by an 'e'. Suddenly that '.' is looking a little too flexible. What you'd really like to say is "a 't'; followed by one of 'h', 'e', 'i', or 'o'; followed by an 'e'". And yes, kiddies, this can be done. We can enclose our list of choices in square brackets thus: /t[heio]e and vi will move the cursor to the next occurrence of 'the', 'tee', 'tie', or 'toe'.
Two characters have special meaning in a list. Suppose that we want to move to "intrepid", the next occurrence of 't.e' except for 'the', 'tee', 'tie', and 'toe'. It's quite simple; use the square brackets again, but precede the list of letters with a carat, '^'. This tells the tool/utility to match any character that is not in this list. Thus /t[^heio]e will move the cursor to the next occurrence of a 't'; follow by a character that is not an 'h', nor an 'e', nor an 'i', nor an 'o'; followed by an 'e'.
Now, if you wanted to match 'any capital letter', or 'any digit', you could type  or [ABC ... XYZabc ... xyz]. But why? That is a whole lotta typing; there should be an easier way, and there is. A range can be denoted by a dash, '-'. For those who like their geeky goodness, a dash denotes all characters from the first to last, inclusive, in ASCII order. If this means nothing to you, don't worry. Just remember that [0-9] will match any digit (and remember that computers -- well, Unix computers -- start counting at 0). [A-Z] will match any uppercase letter, [a-z] any lowercase letter, and any 'word character' can be matched by [A-Za-z0-9_]. (The underscore, '_', is considered a valid word character in Unix for reasons that are just far too boring to go into.)
And what if you want to match a list of characters that include the carat and/or the dash? Just make sure that the dash goes first and the carat isn't first in the list of characters between the s. So, to find either a carat or a dash in vi, type /[-^].
As previously mentioned, the asterisk has a particular meaning in a regular expression. It indicates that one must match zero or more of the preceding character. So, in vi, the command /te* will move the cursor to the next occurrence of 't', or 'te', or 'tee', or 'teee', etc. That seems silly, but regular expression symbols can be combined to suit your purpose. Therefore, in vi, typing /[Tt]he .* lawyer will move the cursor to the next place where 'the' or 'The' is followed by a space, followed by any number of any characters (except a newline character), followed by a space, followed by 'lawyer'. This will match 'the slimy lawyer', 'the greedy lawyer', 'The evil, scum-sucking lawyer', and even 'The snowball was thrown at a deserving lawyer but hit another who probably deserved it, being a no-good, thieving, corrupt government lawyer'.
Notice one thing about that last phrase. The word 'lawyer' appears twice in the sentence. Regular expressions tend to be greedy, as the saying goes. They will match as large a chunk of text as they can. This doesn't mean anything when you are just searching for text. However, when we get to more advanced commands that can replace a regular expression with a string, this is worth knowing; you wouldn't want to take out too much text.
So far so good. But wait, there's more. Suppose you want to go to the next place that is matched by /[Tt]he .* lawyer, but only if it appears at the beginning of the line? This is done with the carat, '^'. Yes, yes, I know, we've already used that in a list of characters. I'm sorry, but I didn't make these rules. I only use them, and this is another case of playing the hand you're dealt. Anyway, to force the search to start at the beginning of the line (or 'anchor it'), use the command /^[Tt]he .* lawyer instead. If you want to anchor the search to the end of the line, the dollar sign is used: /[Tt]he .* lawyer$.
At this point, you may be wondering (then again, you may not) about matching one of those special characters. After all, there will come a time when you want to match an actual asterisk or dollar sign, or even a '/'. The "\" acts as an escape character; it tells the regular expression interpreter "The following character may have a special meaning to you. However, ignore that, and use it literally." So, in vi, to advance the cursor to the next occurrence of 'St. Paul', just type /St\. Paul, and of course, it escapes itself, so to match a '\', you enter /\\.
These symbols, /, ., *, ^, $, \, [, and ], are the core of Unix regular expressions. The authors of some tools that use regular expressions added features that they found useful. However, they also tried to build on what was done before, so if a symbol has a special meaning in a regular expression for one Unix tool, it will have that special meaning in other tools, or no special meaning at all. For example, ex and vi, which you'll recall is based on ex, allow you to anchor searches at word boundaries. This is done with the \< and \> pairs of symbols, which anchor a match at the beginning and end of a word, respectively. It will be remembered (how could it be forgotten?) that we initially wanted to match the word 'the'. These word-boundary anchors allow us to do that. By typing /\<the, we will match the word 'the' but not the word 'breathe'. /the\> will match 'the' and 'breathe', but not 'theater'. To match the word 'the' and none other, and to include matches at the beginning of a sentence, we use /\<the\>. To include matches where 'the' is the first word in a sentence, this becomes /\<[Tt]he\>. As someone who has worked with regular expressions longer than you've had hot meals, I suggest this as the method to build them up, using simple steps to piece together a regular expression that would normally daunt you. In fact, it's a pretty good method to build anything, come to think of it.
(On a side note, it is inconsistent that the '\' normally prevents a character from having special meaning, while with anchoring a word, it provides a special meaning for < and >. I don't know why it was done this way, either.)
Regular expressions put a lot of power in your hands. Some of the command line's most powerful tools use them. Beginning with the next column, we will examine one of the most popular ones, grep.
[*] Actually, it represents any character except a newline character, but how often is a word split across different lines like that? That is a nitpicky little detail that only us pathological types who program in Perl worry about. We're funny that way.
You are encouraged to send Richard your comments, or to post them below.
Most Recent Mac OS X Command Line 101 Columns
Command Line History & Editing Your Commands
Pico: An Easy To Use Command Line Editor
Understanding The "grep" Command In Mac OS X
Command Line History & Editing Your Commands
Mac OS X Command Line 101 Archives
Back to The Mac Observer For More Mac News!
Richard Burton is a longtime Unix programmer and a handsome brute. He spends his spare time yelling at the television during Colts and Pacers games, writing politically incorrect short stories, and trying to shoot the neighbor's cat (not really) nesting in his garage. He can be seen running roughshod over the TMO forums under the alias tbone1.