Wednesday, June 18, 2008

Handling Greediness In Regular Expression Matching Using Python re

Regular expressions matches are by default greedy . I've been trying to brush up regular expressions and have been playing around with the Python re package. While attempting some exercises from Core Python Programming - Wesley Chun , I ran into the greedy behavior myself and had some fun trying to debug it. Here is my story :

1. Suppose you want to match an address that starts with the string 'F-277' , where 'F' denotes some category and 277 represents the house number. This format is not cast in stone however, someone may choose to write down the address as "Apt No 277" .

2. My first regular expression was r'(\w+-?)\s?(\w*)\s?(\d+)' . In plain English, this represents the following "Match "any number of alphanumeric characters (\w+), followed by an optional hyphen (-), followed by zero or 1 white spaces (\s?) followed by another optional set of alphanumeric characters (\w*) followed by zero or 1 whitespace (\s?), followed by any number of numeric characters (\d+) .

3. This expression matches both the above addresses fine :
import re
re_exp = r'\w+-?\s?\w*\s?\d+'

>>> re.match(re_exp, 'F-277').group() 
F-277 
>>> re.match(re_exp, 'Apt No 277').group() 

'Apt No 277' 

 4. All is fine and dandy until we decide we want to extract the house number. To do this, we use grouping . The modified regular expression is : r'\w+-?\s?\w*\s?(\d+)

>> re.match(r'\w+-?\s?\w*\s?(\d+)', 'F-277').groups()  
('7',)

oops - what just happened here ? Even though we made the entire numerical part into a single group, the \d+ portion of the regular expression simply matches 7 as opposed to 277.

5. A little brain racking and experimenting reveals that the earlier part of the regular expression (\w*) was greedily matching part of the number.  

>>> re.match(r'\w+-?\s?(\w*)(\d+)', 'F-277').groups()
('27', '7').

As you can see \w* greedily matched '27' in '277' leaving the '7' to be matched by '\d+' . Why did it not match the whole number ? - one may wonder. This is because of backtracking, explained beautifully here (Section - Watch out for the Greediness) .

6. One way to fix this is to force the offending '*' operator in \w* to go lazy. We do this by putting a '?' after the '*' operator. Here is the result of our match :

>>> re.match(r'\w+-?\s?\w*?\s?(\d+)', 'F-277').groups() [0]  
'277'

There is a better way - using negated character classes. This avoids backtracking as explained here.
>> re_exp = r'\w+-?\s?([^\d]\w)*\s?(\d+)
>> re.match(re_exp, 'F-277').groups()[1]
 '277'

7. We verify that the original expressions still match .  
>>> re.match(re_exp, 'F-277').group() 
'F-277'

>>> re.match(re_exp, 'Flat No 277').group() 
'Flat No 277'

Tuesday, June 10, 2008

Automating Miktex using Python, win32com on Windows

I downloaded the Latex Miktex implementation yesterday with the goal of starting to learn to write Latex code. Unfortunately my roving eye caught the words "SDK" and off I went figuring out how to externally connect to and tinker with Mitex. When I found it used COM, i was even more excited, having tinkered with COM and VirtualBox recently.

This was extra fun since I had wanted to see how to do all this using Python for some time now. Mark Hammond's excellent win32com Python extension makes it super easy to write COM clients using Python .

After some experimenting around, i was able to write the Python equivalent of a couple of the C++ samples provided by the Miktek SDK.

STEPS :

# Install Python Win32 Extensions from https://sourceforge.net/projects/pywin32/

# Build COM library specific to MiKTeX.Session

1. C:\Python25\Lib\site-packages\win32com\client\makepy.py

2. Select "MiKTeX.Session" in the list presented.

makepy.py writes out the relevant files to the Python installation.


CODE : (Save as a Python file / just type in at the IDLE prompt)

import win32com.client as w

# instantiate COM library
c = w.Dispatch("MiKTeX.Session")
path = ''

# Calling Findfile
print c.FindFile("xunicode.sty", path)
(True, u'C:\\Program Files\\MiKTeX 2.7\\tex\\xelatex\\xunicode\\xunicode.sty')

# Getting installation information
print c.GetMiKTeXSetupInfo().installRoot
u'C:\\Program Files\\MiKTeX 2.7'

print c.GetMiKTeXSetupInfo().commonConfigRoot
u'C:\\Documents and Settings\\All Users\\Application Data\\MiKTeX\\2.7'


# Calling FindPkFile
print c.FindPkFile('cmr10', 'ljfour', '600')
(True, u'C:\\Documents and Settings\\All Users\\Application Data\\MiKTeX\\2.7\\fonts\\pk\\ljfour\\public\\cm\\dpi600\\cmr10.pk')


Now back to learning Tex !

UPDATE : I fixed several typos that made the above code almost unusable. Should work now.

Tuesday, June 3, 2008

Bulding and Running the VirtualBox SDK sample using COM, VC++

VirtualBox is Sun Microsystem's desktop virtualization software which allows you to do cool stuff like running Linux right inside Windows without having to repartition Windows or anything like that. They recently released 1.6 version allows you to run even Solaris (not OpenSolaris yet at the time of this writing) inside Windows. (They also support a bunch of other operating systems as hosts and guests).

Check it out - its a breeze to download and get a Linux (e.g. Knoppix) distro working within minutes. With USB support, you can even mount a pen-drive and save your files externally.

I was poring over the developer docs to what they exposed in terms of an API. Its pretty cool - Virtualbox allows access to its runtime using both web services (Apache Axis et. al.) as well as using the now almost defunct COM (component object model) technology . It also seems to support the cross-platform Mozilla XPCOM but I haven't looked at it yet.

Having worked with COM a lot prior to 2000, i was eager to try out the COM approach. The docs however disappoint immensely as it seems to assume that if you were someone truly interested in really taking that route, you would know what to do. Turned out, my COM was rusty and after some digging around for almost non-existent COM documentation/articles, I figured I wasn't linking in the interfaces file generated by VirtualBox.

Install VirtualBox if you already haven't. The SDK is present under \Sun\xVM VirtualBox\sdk\

Here is a gist of steps I took :

Visual C++

1. Fire up your favorite C++ compiler (I use the free VC++ 9.0 Express edition) and create a Win32 console project.
2. Add the sdk\samples\API\tstVBoxAPIWin.cpp.
3. Add sdk\include to the INCLUDE path.
4. Don't forget to also add sdk\lib\VirtualBox_i.c to your sources since this file contains all the generated COM interfaces for VirtualBox.
5. Add sdk\lib to your linker path .
6. Run a build.


CYGWIN (using gcc, mingw32)

1. Copy all source files i.e. (
tstVBoxAPIWin.cpp, VirtualBox_i.c, VirtualBox.h) to a working directory.

2. gcc -c
tstVBoxAPIWin.cpp VirtualBox_i.c

3. export LD_LIBRARY_PATH = c:\WINDOWS\system32

4. gcc -o vbox VirtualBox_i.o tstVBoxAPIWin.o -mno-cygwin -lole32 -loleaut32


The resultant executable when run, prints out all my virtual machines for the various guest OSes I run on my Windows Box.

Output :

Name: knoppix-on-xp
Name: LKB_ON_KNOPPIX
Name: open-solaris

Not much achieved but for me this is a cool way to get back into my COM programming days while tinkering with my favorite virtualization software. The install has a SDK reference that gives all gory details about the various interfaces exposed by the COM interface. Onto more tinkering .........


References :

  • http://forums.virtualbox.org/viewtopic.php?t=6948