Wednesday, June 18, 2008

Handling Greediness In Regular Expression Matching Using Python re

Regular expressions matches are by default greedy . I've been trying to brush up regular expressions and have been playing around with the Python re package. While attempting some exercises from Core Python Programming - Wesley Chun , I ran into the greedy behavior myself and had some fun trying to debug it. Here is my story :

1. Suppose you want to match an address that starts with the string 'F-277' , where 'F' denotes some category and 277 represents the house number. This format is not cast in stone however, someone may choose to write down the address as "Apt No 277" .

2. My first regular expression was r'(\w+-?)\s?(\w*)\s?(\d+)' . In plain English, this represents the following "Match "any number of alphanumeric characters (\w+), followed by an optional hyphen (-), followed by zero or 1 white spaces (\s?) followed by another optional set of alphanumeric characters (\w*) followed by zero or 1 whitespace (\s?), followed by any number of numeric characters (\d+) .

3. This expression matches both the above addresses fine :
import re
re_exp = r'\w+-?\s?\w*\s?\d+'

>>> re.match(re_exp, 'F-277').group() 
F-277 
>>> re.match(re_exp, 'Apt No 277').group() 

'Apt No 277' 

 4. All is fine and dandy until we decide we want to extract the house number. To do this, we use grouping . The modified regular expression is : r'\w+-?\s?\w*\s?(\d+)

>> re.match(r'\w+-?\s?\w*\s?(\d+)', 'F-277').groups()  
('7',)

oops - what just happened here ? Even though we made the entire numerical part into a single group, the \d+ portion of the regular expression simply matches 7 as opposed to 277.

5. A little brain racking and experimenting reveals that the earlier part of the regular expression (\w*) was greedily matching part of the number.  

>>> re.match(r'\w+-?\s?(\w*)(\d+)', 'F-277').groups()
('27', '7').

As you can see \w* greedily matched '27' in '277' leaving the '7' to be matched by '\d+' . Why did it not match the whole number ? - one may wonder. This is because of backtracking, explained beautifully here (Section - Watch out for the Greediness) .

6. One way to fix this is to force the offending '*' operator in \w* to go lazy. We do this by putting a '?' after the '*' operator. Here is the result of our match :

>>> re.match(r'\w+-?\s?\w*?\s?(\d+)', 'F-277').groups() [0]  
'277'

There is a better way - using negated character classes. This avoids backtracking as explained here.
>> re_exp = r'\w+-?\s?([^\d]\w)*\s?(\d+)
>> re.match(re_exp, 'F-277').groups()[1]
 '277'

7. We verify that the original expressions still match .  
>>> re.match(re_exp, 'F-277').group() 
'F-277'

>>> re.match(re_exp, 'Flat No 277').group() 
'Flat No 277'

0 comments: