Regular expressions and special characters
Special characters make text processing more complicated because you have to pay close attention to context. If you're looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular expressions, or vice versa.
This post goes through an example in detail that shows how to manage special characters in several different contexts.
Escaping special TeX charactersI recently needed to write a regular expression [1] to escape TeX special characters. I'm reading in text like ICD9_CODE and need to make that ICD9\_CODE so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.
Underscore isn't the only special character in TeX. It has ten special characters:
\ { } $ & # ^ _ % ~
The two that people most commonly stumble over are probably $ and % because these are fairly common in ordinary prose. Since % begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.
So whenever my script sees a TeX special character that isn't already escaped, I'd like it to escape it.
Raw stringsFirst I need to tell Python what the special characters are for TeX:
special = r"\\{}$&#^_%~"
There's something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The r prefix in front of the quotes tells Python this is a "raw" string and that it should not interpret backslashes as special. It's saying "I literally want a string that begins with two backslashes."
Why two backslashes? Wouldn't one do? We're about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.
LookbehindHere's my regular expression:
re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)
I want special characters that have not already been escaped, so I'm using a negative lookbehind pattern. Negative lookbehind expressions begin with (?<! and end with ). So if, for example, I wanted to look for the string "ball" but only if it's not preceded by "charity" I could use the regular expression
(?<!charity )ball
This expression would match "foot ball" or "foosball" but not "charity ball".
Our lookbehind expression is complicated by the fact that the thing we're looking back for is a special character. We're looking for a backslash, which is a special character for regular expressions [2].
After looking behind for a backslash and making sure there isn't one, we look for our special characters. The reason we used two backslashes in defining the variable special is so the regular expression engine would see two backslashes and interpret that as one literal backslash.
CapturesThe second argument to re.sub tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.
We want to tell re.sub to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it \\ to represent a literal backslash. When we follow this with \1 for the first capture, the result is \\\1 as above.
TestingWe can test our code above on with the following.
line = r"a_b $200 {x} %5 x\y"
and get
a\_b \$200 \{x\} \%5 x\\y
which would cause TeX to produce output that looks like
a_b $200 {x} %5 x\y.
Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the r in front of the opening quote.
P.S. on raw stringsNote that you don't have to use raw strings. You could just escape your special characters with backslashes. But we've already got a lot of backslashes here. Without raw strings we'd need even more. Without raw strings we'd have to say
special = "\\\\{}$&#^_%~"
starting with four backslashes to send Python two to send the regular expression engine one.
Related posts- Four tips for learning regular expressions
- Unicode / LaTeX conversion
- Daily regular expression tips via Twitter
[1] Whenever I write about using regular expressions someone will complain that my solution isn't completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I'm just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.
[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself "Special to whom?". It's especially confusing here because backslash is special to all three languages.