Introduction to regular expressions in R

Note: These are lecture notes from my undergraduate economic forecasting class. It might not be a self-contained post.

Regular expressions is not really something you can learn with a short example. The purpose of this post is to motivate the use of regular expressions (regex) and give you an idea of how they can be used. The best way to learn regex is probably by doing a bunch of text processing in Perl (the library has some good Perl books if you really want to learn a new programming language). Since most you aren’t looking to learn a new language, I’ll use R instead.

What are regex?

You can think of regular expressions as a (very peculiar) language for specifying patterns in text. If you’ve ever used Twitter, you’ve seen the notorious “hashtag”. Suppose someone tweets this:

It’s so much fun to watch #football with #friends

The hashtag has a special status. It’s not just part of the text. It has to be turned into a clickable link that brings up all other recent tweets with that hashtag. How can we identify #football and #friends inside a tweet?

At first glance, this doesn’t seem very difficult. It’s not hard to search a tweet for #football. The problem is that it would take a long time to search for every possible hashtag anyone could use, in any language, every time they tweet. Not only that, but searching for #friends would return tweets with #friendship. That wouldn’t be the end of the world given their similar meanings. On the other hand, #eat and #eats**t have somewhat different meanings, so you can’t limit yourself to searching for tweets that start with a particular hashtag. You need an exact match.

The solution is to specify a pattern to look for rather than a piece of raw text. We could use this pattern (without actually writing the regex itself, which would scare you):

  1. # followed by
  2. a group of letters

Let’s test whether it would detect both hashtags in the tweet above:

  • It detects #football.
  • It detects #friends.

Now ask whether it would only identify what you want:

  • kstate#football meets both criteria.

If you’re a Twitter user, you know that is not what we’re after. We really need this:

  1. Space followed by
  2. # followed by
  3. a group of letters

That continues to detect both of our hashtags, and only our hashtags. We’ve now ruled out kstate#football. Upon further thought, our pattern was a bit limited. Not every hashtag will be preceded by a space. Consider this tweet:

#lovelyday It’s so much fun to watch #football with #friends

At this point you probably think tweets with three hashtags and one of them at the start of the tweet suck. You might be right, but that doesn’t mean it’s an invalid use of the hashtag. We need to modify our pattern to this:

  1. Start of line or space followed by
  2. # followed by
  3. a group of letters

This pattern (which is nothing but precise thought about the definition of a hashtag) can be represented like this:

  1. (^|\s)
  2. #
  3. [a-z]+

Here’s the interpretation of each piece:

  • (^|\s). ^ identifies the start of the line. \s identifies a whitespace character. | means one or the other. () means to consider the stuff inside as a group.
  • # is self-explanatory.
  • [a-z] means a lowercase character. + means to match the previous item (lowercase letter) as many times as possible. This is a greedy match.

If you want to allow lower or upper case in hashtags, you can replace 3. with

3. `[a-zA-Z]+`

The [a-zA-Z] specifies any lower or upper case letter. Putting it all together, we have this regex, which is actually one of the better looking ones:

(^|\s)#[a-zA-Z]+

Rather than running away screaming, when you encounter a regex, you should remember the first rule when encountering regex:

Break it down into individual components rather than looking at the entire thing.

It typically isn’t so bad when you do that. It’s kind of like the first time you play quarterback. It throws you off when the players are chasing you, but after a while you get used to it.

Just to see how you’d use a regex in R code, let’s replace all hashtags in a tweet with the literal phrase %hashtag%. Here’s the tweet:

#happy I passed my macro #exam thanks to #studying

And here’s the code to do the replacement:

gsub("(^|\\s)#[a-zA-Z]+", "%hashtag%", 
  "#happy I passed my macro #exam thanks to #studying")

gsub is used for “global substitution”. The first argument is the regex pattern to look for, the second is the replacement, and the third is the original string needing the replacement. Two options are commonly used, perl=TRUE says to use the same regex engine as Perl, and fixed=TRUE means to treat the first argument as a string rather than a regex pattern.

Note that I used a double backslash for whitespace: \\s. That’s a technical issue with strings in R (and other languages). A plain \s is evaluated as a special character and substituted into the string. \\s is treated as a literal \s inside the string. Confusing as that might be, just double your backslashes inside your regexes and you’ll have what you want.

How are they used for data analysis?

Hopefully you now understand what regex is and what it’s good for, even if you’re not an expert on their use. Three common tasks in data analysis are:

  • Searching: Does a string contain a certain text pattern?
  • Replacing: Our example above demonstrates replacement.
  • Capturing: This is a more advanced use. You can take the string you match, transform it in some way, and even insert it back into the original string.

A very common task is to prepare data for reading into R. Suppose you download the following data in a csv file:

7.4,3.8 (R)
2.9,4.2
3.8,6.7
8.1,1.4 (1)
9.2,1.7 (2)
5.9,6.2 (P)

The provider of the data has kindly added (R) to denote revised data, (1) to point you to footnote 1, (2) to point you to footnote 2, and (P) to denote that the last observation is preliminary.

Just to be clear, it’s good to have that information. You need to know that information. That doesn’t help you much when you want to do the data analysis. You could apply this gsub call to remove that information from the first line:

gsub('\\s\\([[:alnum:]]\\)', '', '7.4,3.8 (R)')

For the entire file:

rawdata <- "7.4,3.8 (R)
2.9,4.2
3.8,6.7
8.1,1.4 (1)
9.2,1.7 (2)
5.9,6.2 (P)"
newdata <- gsub('\\s\\([[:alnum:]]\\)', '', rawdata)
cat(newdata)

Some functions that might be of interest

  • readLines if you want to read a data file in without attempting to parse it. To read in a file data.csv without treating it as data, you can call readLines("data.csv") and it will return a vector where each line is a string. You can then transform each line as needed to remove the unwanted information.

    In the rare case where this doesn’t work, because you need the entire file as one big string, you can do paste(readLines("data.csv"), collapse="\n"). The call to paste converts the strings in the vector back into a single string, separated by a line feed. I’d add a simpler function that does this to the tstools package but I’ve never encountered a case where I’ve needed to do it. readLines has always worked.

  • sub replaces the first match to your pattern. gsub replaces all matches.

  • grep, grepl, regexpr, gregexpr and regexec offer different ways to do searches.

  • regmatches for working with the text that matches your regex.

This is the help for these functions. This is basic information about regex in R.

How can you get better at regex?

I use the following two-step strategy:

  • Realize I need a regex.
  • Search for the parts I don’t know how to specify.

The complexity of advanced regex is such that almost any pattern you need will come up if you search correctly. This is one of the many good regex sites out there.