Three common Regular Expression gotchas


Background / problem statement

Recently I had the pleasure to debug a gnarly issue which came about during a routine version bump of alpinelinux’s elixir package.

And since I’m quite fond of regular expressions1 (aka regexps), it made me reflect a little bit about a few common regular expressions gotchas2.

I easily came up with three. I’ll discuss them and some possible mitigations.

Trap 1: Accidental character range

This is one of the stars of the aforementioned Elixir issue3.

When creating a “bracket expression” ([]), any characters within the brackets separated by - are considered a (character) range.

Hardly surprising stuff… and usually “bread and butter” of regexps: [a-zA-Z0-9].

But things get slightly more complicated with named character classes (e.g. [:alnum:]) and wilder animals (other edge cases).

And while regex(7) man page explicitly points out:

To include a literal ‘]’ in the list, make it the first character (following a possible ‘^’). To include a literal ‘-’, make it the first or last character, or the second endpoint of a range.

and:

With the exception of these and some combinations using ‘[’ (see next paragraphs), all other special characters, including ‘\’, lose their special significance within a bracket expression.

It’s still easy to make a mistake.

For example this one: [^/-_].

The intent is to match all characters but [/, -, _], right?

Can you smell the rat?

Well, oops4:

# busybox sed v1.31.1 & v1.36.0
$ ruby -e 'puts (32..126).map(&:chr).join' | sed 's,[^/-_],,g'
/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

# sed (GNU sed) 4.7
$ ruby -e 'puts (32..126).map(&:chr).join' | sed 's,[^/-_],,g'
/:;<=>?@[\]^_

# Ruby 2.5.1p57
$ ruby -e 'puts (32..126).map(&:chr).join.gsub(%r"[^/-_]", "")'
/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

# Perl5
ruby -e 'puts (32..126).map(&:chr).join' | perl -pe 's,[^/-_],,g'; echo 
/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

That’s quite an additional zoo that slipped through the accidental range. Plus, GNU sed behaves differently5 than the rest of the bunch.

Because what [^/-_] really says is: anything not between / (ascii 47) and _ (ascii 95), which is quite a lot:

# ruby
p (?/..?_).to_a.join # -> "/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_"

Which explains the output above for all but GNU sed. GNU sed just chose to be a special snowflake. ;)

Anyway, fixing the accidental range (moving the - to the end), we get expected result, consistent across the board:

$ ruby -e 'puts (32..126).map(&:chr).join' | sed 's,[^/_-],,g'
-/_

Trap 2: Improper escaping

Improper escaping is closely related to the previous issue. Remember the regex(7) snippet mentioning that \ should lose special significance?

Yes.

Let’s explain this, then:

# GNU sed
$ echo "abc/-\\~%" | sed 's/[^a-zA-Z0-9_\-\/]/\\&/g'
abc/\-\\~\%

# busybox 1.31.1
$ echo "abc/-\\~%" | sed 's/[^a-zA-Z0-9_\-\/]/\\&/g'
abc/\-\\~\%

# busybox 1.36.0
$ echo "abc/-\\~%" | sed 's/[^a-zA-Z0-9_\-\/]/\\&/g'
sed: bad regex '[^a-zA-Z0-9_\-/]': Invalid character range

The problem here is that you could accidentally interpret the inside of the range as having either \-\ (range containing only \) or as _-/ (invalid range between ascii 95 and ascii 47).

Because the environment around the regexp often influences the escaping rules in non-trivial fashion.

A famous example is:

$ ruby -e 'puts "a\\b\\\\c"'
a\b\\c
$ ruby -e 'puts "a\\b\\\\c"' | sed 's@\\@\\\\@g'
a\\b\\\\c
$ ruby -e 'puts "a\\b\\\\c".gsub(/\\/, "\\")'
a\b\\c
$ ruby -e 'puts "a\\b\\\\c".gsub(/\\/, "\\\\")'
a\b\\c
$ ruby -e 'puts "a\\b\\\\c".gsub(/\\/, "\\\\\\")'
a\\b\\\\c
$ ruby -e 'puts "a\\b\\\\c".gsub(/\\/, "\\\\\\\\")'
a\\b\\\\c

The no delta between the two pairs of Ruby invocations is particularly delicious.

Anyway, for the seds of the world, my advice would be to use a different separator than the notorious slash. Especially if all your sed usage is for the s/this/that/ bit.

That’s why I’m fond of @ as a separator (as shown above); but any other ascii7 printable character will do6. Bonus points: when it doesn’t clash with regex syntax. Say, [/,@#-], etc.

But for the rest of them? Automated tests are pretty much the only way to be sure.

Regardless how skilled one is, regular expressions have sharp edges. And when inserted into production without accompanying tests, one of them will eventually bite you in the ass7.

Trap 3: Improper anchoring

This trap should be obvious, and yet… not really.

If I had to speculate, it afflicts people using grep more than others.

^You are so used to the anchors being beginning and end$ that you forget that there’s a whole world of edge cases in between.

For example:

Both examples assume that ^ is the start of input while it (based on the context) might be a start of a line instead. Ditto for $ being the end.

And it’s hard to prescribe a universal solution. Every regexp engine under the sun is slightly different. For Ruby it’s \A and \z as the proper anchors. For Java it depends on the method9. For shell utilities it varies still.

Automated testing is – again – the only durable solution.

For the record – for quick checks in Ruby, I’m fond of the Rubular web service10.

Closing words

Those were my top three regular expression gotchas.

I feel a little silly for having nothing better to offer as a mitigation than tests and RTFM.

But it is fitting: with powerful tools (and regexes certainly fit the bill) come many dangers. Not all obvious.

So: engage safety squints!

  1. Some evil tongues even say I’m too fond of them.

  2. Or, as Dave Jones of eevblog says: traps for the young players.

  3. Accidentally assigned #12345 id. I could not complain about this happy little accident. :)

  4. The Ruby generates all ascii chars between 32 (space) and 126 (tilde).

  5. One could say it’s even more magical than others.

  6. Half expecting some zealot to jump at me that any single-byte would do. And they’d be right (ruby -e 'puts "a\\b\\\\c"' | sed "$(ruby -e 'print "s\x1\\\\\x1\\\\\\\\\x1g"')"). But is that really wise?

  7. I meant… page you at 4:30am on Sunday, when you’re trying to sleep off one too many margaritas.

  8. Perpetrator of which shall remain nameless. But you know who you are. ;)

  9. And even normal methods like Matcher#replaceAll have additional gotchas.

  10. Because it allows quick prototyping (and then even permalinking your regexp and test input).