Friday, May 30, 2008

Detecting cpus/cores in Python

I found this interesting function in a post on Twister and distributed programming by Bruce Eckel. It uses the Python os package to detect the number of CPUs/cores on a machine. Archiving it here for future reference

def detectCPUs():
"""
Detects the number of CPUs on a system. Cribbed from pp.
"""
# Linux, Unix and MacOS:
if hasattr(os, "sysconf"):
if os.sysconf_names.has_key("SC_NPROCESSORS_ONLN"):
# Linux & Unix:
ncpus = os.sysconf("SC_NPROCESSORS_ONLN")
if isinstance(ncpus, int) and ncpus > 0:
return ncpus
else: # OSX:
return int(os.popen2("sysctl -n hw.ncpu")[1].read())
# Windows:
if os.environ.has_key("NUMBER_OF_PROCESSORS"):
ncpus = int(os.environ["NUMBER_OF_PROCESSORS"]);
if ncpus > 0:
return ncpus
return 1 # Default

Wednesday, May 28, 2008

Screen Scraping using Python and BeautifulSoup

Screen scraping is an old art - made a lot easier using XHTML parsers that allow HTML to be represented as a navigable tree. BeautifulSoup is one such powerful and popular Python library that allows even malformed HTML documents to be parsed and navigated for content.

Here is a code snippet that screen scrapes the "EBook Deal of The Day" from the Apress website.

Disclaimer : I am not sure if this is allowed by the Apress webmasters. I could not find any such information on their website explicitly disallowing this. In general screen-scraping is frowned upon by webmasters, since such programs could be used as bots to grab large amounts of potentially copyrighted information from their websites and contribute to unnecessary traffic. This may even lead to them blocking an offending IP address.
The purpose of this example is merely to illustrate how BeautifulSoup and Python can be used to quickly put together a screen-scraping example. This allows me to quickly check one of my favorite links daily without having to fire up the browser each time. The use of this script is your own responsibility.


1. This post assumes a working Python installation on your machine. If you don't have one, you can download it at http://www.python.org.

2. Download BeautifulSoup from here and save the resultant file in your local Python location. (e.g. C:\Python25\Lib\site_packages)

3. Check that BeautifulSoup works fine.
>> from BeautifulSoup import BeautifulSoup

(Watch out for a common mistake - import BeautifulSoup. This merely imports the module leading to later errors when you try to initialize the constructor BeautifulSoup() ) .

4. Copy the following code to a local folder as say apress.py and type
> python apress.py .

#apress.py

import urllib
from BeautifulSoup import BeautifulSoup

# open local/remote url

def get_data(url, local):
if (local) :
return open(url)
else :
return urllib.urlopen(url)


local = False
base_url = "http://www.apress.com"
deal_url = base_url + '/info/dailydeal'

# local testing
#deal_url = "c:\\mycode\\dailydeal2.htm"
#data = open(deal_url)

# remote url
data = get_data(deal_url, local)

bs = BeautifulSoup(data)

# screen-scrape the following HTML fragment
# to get book name, book description
'''<div class="bookdetails">

<h3><a href="http://www.apress.com/book/view/1590592778">The Definitive Guide to Samba 3</a></h3>

<div class="cover"><a href="http://www.apress.com/book/view/1590592778"><img src="dailydeal2_files/9781590592779.gif" alt="The Definitive Guide to Samba 3" align="left" border="0" width="125"></a></div>

<p>Samba
is an efficient file and print server that enables you to get the most
out of your computer hardware. If you're familiar with Unix
administration, TCP/IP networking, and other common Unix servers, and
you want to learn how to deploy the forthcoming revision of Samba, this
book is ideal for you. </p><div class="footer">$49.99 | Published Apr 2004 | Roderick W. Smith</div>

</div>
'''

book = bs.findAll(attrs= {"class" : "bookdetails"})
a = book[0].h3.find('a')

# grab URL to get book details later
rel_url = a.attrs[0][1]
abs_url_book_det = base_url + rel_url

# extract book name
book_name = a.contents[0] # just 1 name
print "Today's Apress $10 Ebook Deal:"
print book_name.encode('utf-8')

# extract book description
desc = book[0].p
print desc.contents[0] + '\n'

#extract book details

# local testing
#abs_url_book_det = "c:\\mycode\\bookdetails.htm"
#details = open(abs_url_book_det)

# remote url
details = get_data(abs_url_book_det, local)
bs = BeautifulSoup(details)

# screen-scrape the following HTML fragment
# to get book details
'''<div class="content" style="padding: 10px 0px; font-size: 11px;">

<a href="http://www.apress.com/book/view/9781590599419"><img src="bookdetails_files/9781590599419.gif" class="centeredImage" alt="Practical DWR 2 Projects book cover" border="0"></a>
<ul class="bulletoff">

<li>By Frank Zammetti </li>
<li>ISBN13: 978-1-59059-941-9</li>
<li>ISBN10: 1-59059-941-1</li>
<li>540 pp.</li>

<li>Published Jan 2008</li>
<li>eBook Price: $32.89</li>
<li>Price: $46.99</li>
'''

det = bs.find(attrs={"class" : "content"})

ul = det.find('li')
while (ul.nextSibling <> None):

if (ul == '\n') :
ul = ul.nextSibling
continue
line = ul.contents[0]
if line.startswith('eBook') :
print line + str(ul.contents[2])
else:
print line.encode('utf-8')
ul = ul.nextSibling

5. You should see a listing of the current $10 Ebook Deal of the Day .

Notes / Suggested Improvements :

1. In order to avoid unnecessary traffic on the Apress website, it would be a good idea to download the files locally and test locally before hitting their URLs. In order to do that :
  • make the variable local = True
  • uncomment the lines that say "local testing" and comment out the corresponding lines that say "remote url". This will call the get_data function with local = True causing it to load the local pages. Of course you'll need to change the lines in the code to point to correct HTML pages in your local disk.
2. The following Python tutorial has an excellent explanation on the use of urllib2 a to submit a "polite" HTTP request to a website as opposed to the quick and dirty approach illustrated above.

http://www.diveintopython.org/http_web_services/summary.html

3. A Book class to wrap and return book details.

4. The code above is only valid as long as the Apress folks do not change their HTML structure. If they did, one would need to rework the navigation to get at the correct data. This is one of the known perils of screen-scraping.

5. If you copy-paste the code, be sure to fix the indentation for the function and 'if' statements.

References :

Writing a Python Extension with SWIG, GCC on Cygwin

UPDATE : Please see comments below this post - the latest version of SWIG appears to have made the "setup.py" step described below - redundant.

t seemed easy at first. Many books document it, several folks blog about it, articles have been written about it, but I didn't quite seem to get it right. Here are my steps and missteps in getting the python sample from the SWIG tutorial to work right on my machine with cygwin 1.5.25, gcc 3.4.4, swig 1.3.35, python 2.5 already installed.

1. Download the sample example.c from the SWIG Tutorial.

2. Get the interface file from the same location and save a local copy in your working directory as say sample.i.

3. Run the swig compiler on it using 'swig -python sample.i' to get the machine generated python extension - example_wrap.c

If you are past this step, you have swig properly installed on your computer. We will now use the python distools package to do the rest of the magic for us.

4. Create a setup.py file that contains all dependencies required to create the Python extension. This is the critical step ! Here is what mine looks like .

from distutils.core import setup, Extension

module1 = Extension('_example', sources=['example_wrap.c',
'example.c'
])
setup (name = 'example',
version = '1.0',
description = 'Simple example from SWIG tutorial',
ext_modules = [module1])

Note the underscore (_) before the module name. This little mistake could trip you up completely and lead to errors from the build process such as :

"Cannot export initexample: symbol not defined"

Apparently SWIG requires that you use an "-" before your module name. Also note that the sources attribute lists the C program and the swig-generated C program. It is also possible to list only the swig interface file as in sources= 'example.i' and allow python to complete Step 3 from above. This did not work for me. It trips up the linking process at a later point. Why ? - I have no clue.


5. Run the Python distool to generate the extension . Note the use of "inplace". This is convenient when developing as it generates the .pyd file (Python extension) in the current working directory. You may remove this if you want it in a standard "build" directory.

python setup.py build_ext --compiler=mingw32 --inplace


NOTE: The '--compiler=mingw32' option is to force compilation using the mingw32 compiler. This can be automated as described here . The mingw compiler should have been installed if you kept the defaults while installing GCC on cygwin. You may also install it separately from here or running the cygwin installer again and choose the mingw option along with its dependencies.

If all goes well, you should see no errors and upon completion, find a file named '_example.pyd' in your working directory.

6. Install the extension by running "python setup.py install" . Alternately simply copy the _example.pyd file to your Python install (e.g. c:\Python25\Lib\site_packages) .

7. If you fire up IDLE and run '>> import example' , you'll probably see an import error . However if you type "import _example", it'll succeed. This is how SWIG generates your module. You probably don't want to use your module this way. The way to fix it is to look for a python file named "example.py" in your current working directory. Copy this file to the same location as above (e.g. c:\Python25\Lib\site_packages) .

8. >>> import example as e

9. >>> e.fact(5)
120




References :


Here are several links I read up to get this example working - my thanks to all those authors. I'll caution though that a lot of the hacks about using pexports etc. that you'll find on many of these web sites are unnecessary if you are using Python 2.5. The Python 2.5 distribution comes with the import library. On my machine this is C:\Python25\libs\libpython25.a.

Cool Python way to beat email harvesting bots

This was so cool -I really had to share it. I found this here while reading up about creating Python extensions using swig. The author uses Python list comprehensions along with chr() and ord() to generate contact email addresses.

Suppose your email address was user@somerandomwebsite.com (SPAM THAT !) . Here is what you would do to generate the correct string sequence for this email address.

>> ''.join([chr(ord(c)+2) for c in 'user@somerandomwebsite.com'])
wugtBuqogtcpfqoygdukvg0eqo


Now stick this back into the previous list comprehension as follows

>> ''.join([chr(ord(c)-2) for c in 'wugtBuqogtcpfqoygdukvg0eqo'])
user@somerandomwebsite.com

I guess such a link on your web page would prevent or at least dissuade anyone except an intermediate Python programmer from trying to contact you - which may or may not be what you want !