Just Tinkering: 2008

Friday, July 11, 2008

Wednesday, July 9, 2008

Just what I was waiting for !! I was looking or a way to brush up my Java skills while continuing my year-long affair with Python when news arrived that Jython was being reincarnated with Sun's hiring of Ted Leung and Frank Wierzbicki . Seeing what Sun did for JRuby, i was all excited when came this other piece of great news that Netbeans was going to support Python/Jython with an upcoming release.

Go Sun ! Go Netbeans !After reading some excellent blogs (see References), I decided to download the Milestone 4.1 standalone nbpython installer (Build 200807071204) and kick some tires.

Platform : Windows XP SP3, pre-installed with Python 2.5

First impressions (based on 10 minutes of tinkering) :

1. I created a New Project and wanted to import some existing Python files into it. Somehow this does not seem possible yet - totally sucks since the first thing that a Python developer is likely to do is to suck in a bunch of existing code and see how Netbeans works with it.

2. nbpython seems to use jython as the default python engine to run python code. On my machine, this failed to run my code throwing the following stack trace.

"java.io.IOException: CreateProcess error=193, %1 is not a valid Win32 applicationat java.lang.ProcessImpl.create(Native Method)at java.lang.ProcessImpl.(ProcessImpl.java:81)at java.lang.ProcessImpl.start(ProcessImpl.java:30)at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)Caused: java.io.IOException: Cannot run program "c:\netbeans\NetBeans\nbpython\jython-2.5\bin\jython"

3. I then decided to try and point nbpython to my existing windows python install (Tools-> Options ->Python).

4. This seemed to do the trick and i could run my first Python program.

Summary :

1. Despite being an early access build, the module does work and can be used for fairly simple testing of python code.

2. Lack of "import" support make this a non-starter if one wants to create projects with existing python code/libraries.

3. If the tool developers want to see this replace IDLE at some point, there has gotta be a way of running a Python interpreter/shell. This is critical to folks like myself who constantly Alt-tab into IDLE to quickly try out a list-comprehension/regex code snippet before integrating it into the main Python file.

4. This stuff needs Netbeans 6.5 and does NOT work on Netbeans 6.1. That probably means repeated downloads of nightly builds instead of simply updating the NBM module each time a change happens. Hope this can be fixed.

References :

Tuesday, July 1, 2008

BloggerBuster

I recently updated the blog template to a 3-column format from BloggerBuster. Pretty cool site for more powerful Blogger templates, tips and tricks on changing blogger templates, blogging tutorials and similar stuff. There is also a free ebook on customizing blogger templates - nice reading.

Wednesday, June 18, 2008

Handling Greediness In Regular Expression Matching Using Python re

Regular expressions matches are by default greedy . I've been trying to brush up regular expressions and have been playing around with the Python re package. While attempting some exercises from Core Python Programming - Wesley Chun , I ran into the greedy behavior myself and had some fun trying to debug it. Here is my story :

1. Suppose you want to match an address that starts with the string 'F-277' , where 'F' denotes some category and 277 represents the house number. This format is not cast in stone however, someone may choose to write down the address as "Apt No 277" .

2. My first regular expression was r'(\w+-?)\s?(\w*)\s?(\d+)' . In plain English, this represents the following "Match "any number of alphanumeric characters (\w+), followed by an optional hyphen (-), followed by zero or 1 white spaces (\s?) followed by another optional set of alphanumeric characters (\w*) followed by zero or 1 whitespace (\s?), followed by any number of numeric characters (\d+) .

3. This expression matches both the above addresses fine :
import re
re_exp = r'\w+-?\s?\w*\s?\d+'

>>> re.match(re_exp, 'F-277').group()
F-277
>>> re.match(re_exp, 'Apt No 277').group()

'Apt No 277'

4. All is fine and dandy until we decide we want to extract the house number. To do this, we use grouping . The modified regular expression is : r'\w+-?\s?\w*\s?(\d+)

>> re.match(r'\w+-?\s?\w*\s?(\d+)', 'F-277').groups()
('7',)

oops - what just happened here ? Even though we made the entire numerical part into a single group, the \d+ portion of the regular expression simply matches 7 as opposed to 277.

5. A little brain racking and experimenting reveals that the earlier part of the regular expression (\w*) was greedily matching part of the number.

>>> re.match(r'\w+-?\s?(\w*)(\d+)', 'F-277').groups()
('27', '7').

As you can see \w* greedily matched '27' in '277' leaving the '7' to be matched by '\d+' . Why did it not match the whole number ? - one may wonder. This is because of backtracking, explained beautifully here (Section - Watch out for the Greediness) .

6. One way to fix this is to force the offending '*' operator in \w* to go lazy. We do this by putting a '?' after the '*' operator. Here is the result of our match :

>>> re.match(r'\w+-?\s?\w*?\s?(\d+)', 'F-277').groups() [0]
'277'

There is a better way - using negated character classes. This avoids backtracking as explained here.
>> re_exp = r'\w+-?\s?([^\d]\w)*\s?(\d+)
>> re.match(re_exp, 'F-277').groups()[1]

'277'

7. We verify that the original expressions still match .
>>> re.match(re_exp, 'F-277').group()

'F-277'

>>> re.match(re_exp, 'Flat No 277').group()

'Flat No 277'

Tuesday, June 10, 2008

Automating Miktex using Python, win32com on Windows

I downloaded the Latex Miktex implementation yesterday with the goal of starting to learn to write Latex code. Unfortunately my roving eye caught the words "SDK" and off I went figuring out how to externally connect to and tinker with Mitex. When I found it used COM, i was even more excited, having tinkered with COM and VirtualBox recently.

This was extra fun since I had wanted to see how to do all this using Python for some time now. Mark Hammond's excellent win32com Python extension makes it super easy to write COM clients using Python .

After some experimenting around, i was able to write the Python equivalent of a couple of the C++ samples provided by the Miktek SDK.

STEPS :

# Install Python Win32 Extensions from https://sourceforge.net/projects/pywin32/

# Build COM library specific to MiKTeX.Session

1. C:\Python25\Lib\site-packages\win32com\client\makepy.py

2. Select "MiKTeX.Session" in the list presented.

makepy.py writes out the relevant files to the Python installation.

CODE : (Save as a Python file / just type in at the IDLE prompt)

import win32com.client as w

# instantiate COM library
c = w.Dispatch("MiKTeX.Session")
path = ''

# Calling Findfile
print c.FindFile("xunicode.sty", path)
(True, u'C:\\Program Files\\MiKTeX 2.7\\tex\\xelatex\\xunicode\\xunicode.sty')

# Getting installation information
print c.GetMiKTeXSetupInfo().installRoot
u'C:\\Program Files\\MiKTeX 2.7'

print c.GetMiKTeXSetupInfo().commonConfigRoot
u'C:\\Documents and Settings\\All Users\\Application Data\\MiKTeX\\2.7'

# Calling FindPkFile
print c.FindPkFile('cmr10', 'ljfour', '600')
(True, u'C:\\Documents and Settings\\All Users\\Application Data\\MiKTeX\\2.7\\fonts\\pk\\ljfour\\public\\cm\\dpi600\\cmr10.pk')

Now back to learning Tex !

UPDATE : I fixed several typos that made the above code almost unusable. Should work now.

Tuesday, June 3, 2008

Bulding and Running the VirtualBox SDK sample using COM, VC++

VirtualBox is Sun Microsystem's desktop virtualization software which allows you to do cool stuff like running Linux right inside Windows without having to repartition Windows or anything like that. They recently released 1.6 version allows you to run even Solaris (not OpenSolaris yet at the time of this writing) inside Windows. (They also support a bunch of other operating systems as hosts and guests).

Check it out - its a breeze to download and get a Linux (e.g. Knoppix) distro working within minutes. With USB support, you can even mount a pen-drive and save your files externally.

I was poring over the developer docs to what they exposed in terms of an API. Its pretty cool - Virtualbox allows access to its runtime using both web services (Apache Axis et. al.) as well as using the now almost defunct COM (component object model) technology . It also seems to support the cross-platform Mozilla XPCOM but I haven't looked at it yet.

Having worked with COM a lot prior to 2000, i was eager to try out the COM approach. The docs however disappoint immensely as it seems to assume that if you were someone truly interested in really taking that route, you would know what to do. Turned out, my COM was rusty and after some digging around for almost non-existent COM documentation/articles, I figured I wasn't linking in the interfaces file generated by VirtualBox.

Install VirtualBox if you already haven't. The SDK is present under \Sun\xVM VirtualBox\sdk\

Here is a gist of steps I took :

Visual C++

1. Fire up your favorite C++ compiler (I use the free VC++ 9.0 Express edition) and create a Win32 console project.
2. Add the sdk\samples\API\tstVBoxAPIWin.cpp.
3. Add sdk\include to the INCLUDE path.
4. Don't forget to also add sdk\lib\VirtualBox_i.c to your sources since this file contains all the generated COM interfaces for VirtualBox.
5. Add sdk\lib to your linker path .
6. Run a build.

CYGWIN (using gcc, mingw32)

1. Copy all source files i.e. (tstVBoxAPIWin.cpp, VirtualBox_i.c, VirtualBox.h) to a working directory.

2. gcc -c tstVBoxAPIWin.cpp VirtualBox_i.c

3. export LD_LIBRARY_PATH = c:\WINDOWS\system32

4. gcc -o vbox VirtualBox_i.o tstVBoxAPIWin.o -mno-cygwin -lole32 -loleaut32

The resultant executable when run, prints out all my virtual machines for the various guest OSes I run on my Windows Box.

Output :

Name: knoppix-on-xp
Name: LKB_ON_KNOPPIX
Name: open-solaris

Not much achieved but for me this is a cool way to get back into my COM programming days while tinkering with my favorite virtualization software. The install has a SDK reference that gives all gory details about the various interfaces exposed by the COM interface. Onto more tinkering .........

References :

http://forums.virtualbox.org/viewtopic.php?t=6948

Friday, May 30, 2008

Detecting cpus/cores in Python

I found this interesting function in a post on Twister and distributed programming by Bruce Eckel. It uses the Python os package to detect the number of CPUs/cores on a machine. Archiving it here for future reference

def detectCPUs():
 """
 Detects the number of CPUs on a system. Cribbed from pp.
 """
 # Linux, Unix and MacOS:
 if hasattr(os, "sysconf"):
     if os.sysconf_names.has_key("SC_NPROCESSORS_ONLN"):
         # Linux & Unix:
         ncpus = os.sysconf("SC_NPROCESSORS_ONLN")
         if isinstance(ncpus, int) and ncpus > 0:
             return ncpus
     else: # OSX:
         return int(os.popen2("sysctl -n hw.ncpu")[1].read())
 # Windows:
 if os.environ.has_key("NUMBER_OF_PROCESSORS"):
         ncpus = int(os.environ["NUMBER_OF_PROCESSORS"]);
         if ncpus > 0:
             return ncpus
 return 1 # Default

Wednesday, May 28, 2008

Screen Scraping using Python and BeautifulSoup

Screen scraping is an old art - made a lot easier using XHTML parsers that allow HTML to be represented as a navigable tree. BeautifulSoup is one such powerful and popular Python library that allows even malformed HTML documents to be parsed and navigated for content.

Here is a code snippet that screen scrapes the "EBook Deal of The Day" from the Apress website.

Disclaimer : I am not sure if this is allowed by the Apress webmasters. I could not find any such information on their website explicitly disallowing this. In general screen-scraping is frowned upon by webmasters, since such programs could be used as bots to grab large amounts of potentially copyrighted information from their websites and contribute to unnecessary traffic. This may even lead to them blocking an offending IP address.
The purpose of this example is merely to illustrate how BeautifulSoup and Python can be used to quickly put together a screen-scraping example. This allows me to quickly check one of my favorite links daily without having to fire up the browser each time. The use of this script is your own responsibility.

1. This post assumes a working Python installation on your machine. If you don't have one, you can download it at http://www.python.org.

2. Download BeautifulSoup from here and save the resultant file in your local Python location. (e.g. C:\Python25\Lib\site_packages)

3. Check that BeautifulSoup works fine.
>> from BeautifulSoup import BeautifulSoup

(Watch out for a common mistake - import BeautifulSoup. This merely imports the module leading to later errors when you try to initialize the constructor BeautifulSoup() ) .

4. Copy the following code to a local folder as say apress.py and type
> python apress.py .

#apress.py

import urllib
from BeautifulSoup import BeautifulSoup

# open local/remote url

def get_data(url, local):
if (local) :
   return open(url)
else :
   return urllib.urlopen(url)


local = False
base_url = "http://www.apress.com"
deal_url = base_url + '/info/dailydeal'

# local testing
#deal_url = "c:\\mycode\\dailydeal2.htm"   
#data = open(deal_url)

# remote url
data = get_data(deal_url, local)

bs = BeautifulSoup(data)

# screen-scrape the following  HTML fragment
# to get book name, book description
'''<div class="bookdetails">

 <h3><a href="http://www.apress.com/book/view/1590592778">The Definitive Guide to Samba 3</a></h3>

 <div class="cover"><a href="http://www.apress.com/book/view/1590592778"><img src="dailydeal2_files/9781590592779.gif" alt="The Definitive Guide to Samba 3" align="left" border="0" width="125"></a></div>

 <p>Samba
is an efficient file and print server that enables you to get the most
out of your computer hardware. If you're familiar with Unix
administration, TCP/IP networking, and other common Unix servers, and
you want to learn how to deploy the forthcoming revision of Samba, this
book is ideal for you. </p><div class="footer">$49.99 | Published Apr 2004 | Roderick W. Smith</div>

</div>
'''

book = bs.findAll(attrs= {"class" : "bookdetails"})
a = book[0].h3.find('a')

# grab URL to get book details later
rel_url = a.attrs[0][1]
abs_url_book_det = base_url + rel_url

# extract book name
book_name = a.contents[0]   # just 1 name
print "Today's Apress $10 Ebook Deal:"
print book_name.encode('utf-8')

# extract book description
desc = book[0].p
print desc.contents[0] +  '\n'

#extract book details

# local testing
#abs_url_book_det = "c:\\mycode\\bookdetails.htm"
#details  = open(abs_url_book_det)

# remote url
details = get_data(abs_url_book_det, local)
bs      = BeautifulSoup(details)

# screen-scrape the following  HTML fragment
# to get book details
'''<div class="content" style="padding: 10px 0px; font-size: 11px;">

   <a href="http://www.apress.com/book/view/9781590599419"><img src="bookdetails_files/9781590599419.gif" class="centeredImage" alt="Practical DWR 2 Projects book cover" border="0"></a>
   <ul class="bulletoff">

   <li>By Frank  Zammetti </li>
   <li>ISBN13: 978-1-59059-941-9</li>
   <li>ISBN10: 1-59059-941-1</li>
   <li>540 pp.</li>

   <li>Published Jan 2008</li>
      <li>eBook Price: $32.89</li>
   <li>Price: $46.99</li>
'''

det     = bs.find(attrs={"class" : "content"})

ul      = det.find('li')
while (ul.nextSibling <> None):

if (ul == '\n') :
   ul = ul.nextSibling
   continue
line = ul.contents[0]
if line.startswith('eBook') :
   print line + str(ul.contents[2])
else:
   print line.encode('utf-8')
ul = ul.nextSibling

5. You should see a listing of the current $10 Ebook Deal of the Day .

Notes / Suggested Improvements :

1. In order to avoid unnecessary traffic on the Apress website, it would be a good idea to download the files locally and test locally before hitting their URLs. In order to do that :

make the variable local = True
uncomment the lines that say "local testing" and comment out the corresponding lines that say "remote url". This will call the get_data function with local = True causing it to load the local pages. Of course you'll need to change the lines in the code to point to correct HTML pages in your local disk.

2. The following Python tutorial has an excellent explanation on the use of urllib2 a to submit a "polite" HTTP request to a website as opposed to the quick and dirty approach illustrated above.

http://www.diveintopython.org/http_web_services/summary.html

3. A Book class to wrap and return book details.

4. The code above is only valid as long as the Apress folks do not change their HTML structure. If they did, one would need to rework the navigation to get at the correct data. This is one of the known perils of screen-scraping.

5. If you copy-paste the code, be sure to fix the indentation for the function and 'if' statements.

References :

Writing a Python Extension with SWIG, GCC on Cygwin

UPDATE : Please see comments below this post - the latest version of SWIG appears to have made the "setup.py" step described below - redundant.

t seemed easy at first. Many books document it, several folks blog about it, articles have been written about it, but I didn't quite seem to get it right. Here are my steps and missteps in getting the python sample from the SWIG tutorial to work right on my machine with cygwin 1.5.25, gcc 3.4.4, swig 1.3.35, python 2.5 already installed.

1. Download the sample example.c from the SWIG Tutorial.

2. Get the interface file from the same location and save a local copy in your working directory as say sample.i.

3. Run the swig compiler on it using 'swig -python sample.i' to get the machine generated python extension - example_wrap.c

If you are past this step, you have swig properly installed on your computer. We will now use the python distools package to do the rest of the magic for us.

4. Create a setup.py file that contains all dependencies required to create the Python extension. This is the critical step ! Here is what mine looks like .

from distutils.core import setup, Extension

module1 = Extension('_example', sources=['example_wrap.c',

'example.c'])

setup (name = 'example',

version = '1.0',

description = 'Simple example from SWIG tutorial',

ext_modules = [module1])

Note the underscore (_) before the module name. This little mistake could trip you up completely and lead to errors from the build process such as :

"Cannot export initexample: symbol not defined"

Apparently SWIG requires that you use an "-" before your module name. Also note that the sources attribute lists the C program and the swig-generated C program. It is also possible to list only the swig interface file as in sources= 'example.i' and allow python to complete Step 3 from above. This did not work for me. It trips up the linking process at a later point. Why ? - I have no clue.

5. Run the Python distool to generate the extension . Note the use of "inplace". This is convenient when developing as it generates the .pyd file (Python extension) in the current working directory. You may remove this if you want it in a standard "build" directory.

python setup.py build_ext --compiler=mingw32 --inplace

NOTE: The '--compiler=mingw32' option is to force compilation using the mingw32 compiler. This can be automated as described here . The mingw compiler should have been installed if you kept the defaults while installing GCC on cygwin. You may also install it separately from here or running the cygwin installer again and choose the mingw option along with its dependencies.

If all goes well, you should see no errors and upon completion, find a file named '_example.pyd' in your working directory.

6. Install the extension by running "python setup.py install" . Alternately simply copy the _example.pyd file to your Python install (e.g. c:\Python25\Lib\site_packages) .

7. If you fire up IDLE and run '>> import example' , you'll probably see an import error . However if you type "import _example", it'll succeed. This is how SWIG generates your module. You probably don't want to use your module this way. The way to fix it is to look for a python file named "example.py" in your current working directory. Copy this file to the same location as above (e.g. c:\Python25\Lib\site_packages) .

8. >>> import example as e

9. >>> e.fact(5)
120

References :

Here are several links I read up to get this example working - my thanks to all those authors. I'll caution though that a lot of the hacks about using pexports etc. that you'll find on many of these web sites are unnecessary if you are using Python 2.5. The Python 2.5 distribution comes with the import library. On my machine this is C:\Python25\libs\libpython25.a.

http://sebsauvage.net/python/mingw.html (Ignore Steps 2,3 if you are using Python 2.5

http://www.swig.org/tutorial.html (The official SWIG tutorial - simple, but deceptively so, supposed to work like a charm, but I wouldn't be writing this if it did !)

http://docs.python.org/inst/tweak-flags.html (Section 6.22 talks about use of distools and some information on building extensions with older versions of Python) .

http://www.swig.org/Doc1.3/Python.html#Python_nn6 (The BEST document PERIOD I found on this subject- unfortunately the last one too !)

http://www.dabeaz.com/cgi-bin/wiki.pl?SwigFaq/SharedLibraries (SWIG wiki entry on building shared libraries on different platforms

http://www.swig.org/Doc1.1/HTML/Contents.html (SWIG Users Manual if you want to do any heavy lifting using SWIG)

http://boodebr.org/main/python/build-windows-extensions#CFG_DISTUTILS (Cleared the clutter about Python 2.5 Vs the older versions. Also has an interesting discussion about using Mingw32 directly Vs gcc on cygwin).

Cool Python way to beat email harvesting bots

This was so cool -I really had to share it. I found this here while reading up about creating Python extensions using swig. The author uses Python list comprehensions along with chr() and ord() to generate contact email addresses.

Suppose your email address was user@somerandomwebsite.com (SPAM THAT !) . Here is what you would do to generate the correct string sequence for this email address.

>> ''.join([chr(ord(c)+2) for c in 'user@somerandomwebsite.com'])
wugtBuqogtcpfqoygdukvg0eqo

Now stick this back into the previous list comprehension as follows

>> ''.join([chr(ord(c)-2) for c in 'wugtBuqogtcpfqoygdukvg0eqo'])
user@somerandomwebsite.com

I guess such a link on your web page would prevent or at least dissuade anyone except an intermediate Python programmer from trying to contact you - which may or may not be what you want !

Just Tinkering

About Me

Friday, July 11, 2008

Grad School by xkcd

Wednesday, July 9, 2008

Netbeans for Python !!

Tuesday, July 1, 2008

BloggerBuster

Wednesday, June 18, 2008

Handling Greediness In Regular Expression Matching Using Python re

Tuesday, June 10, 2008

Automating Miktex using Python, win32com on Windows

Tuesday, June 3, 2008

Bulding and Running the VirtualBox SDK sample using COM, VC++

Friday, May 30, 2008

Detecting cpus/cores in Python

Wednesday, May 28, 2008

Screen Scraping using Python and BeautifulSoup

Writing a Python Extension with SWIG, GCC on Cygwin

References :

Cool Python way to beat email harvesting bots

Label Cloud

My Blog List

Blog Archive

Subscribe To

About Me

Friday, July 11, 2008

Wednesday, July 9, 2008

Tuesday, July 1, 2008

Wednesday, June 18, 2008

Tuesday, June 10, 2008

Tuesday, June 3, 2008

Friday, May 30, 2008

Wednesday, May 28, 2008

References :

Label Cloud

My Blog List

Blog Archive