Wednesday, May 28, 2008

Screen Scraping using Python and BeautifulSoup

Screen scraping is an old art - made a lot easier using XHTML parsers that allow HTML to be represented as a navigable tree. BeautifulSoup is one such powerful and popular Python library that allows even malformed HTML documents to be parsed and navigated for content.

Here is a code snippet that screen scrapes the "EBook Deal of The Day" from the Apress website.

Disclaimer : I am not sure if this is allowed by the Apress webmasters. I could not find any such information on their website explicitly disallowing this. In general screen-scraping is frowned upon by webmasters, since such programs could be used as bots to grab large amounts of potentially copyrighted information from their websites and contribute to unnecessary traffic. This may even lead to them blocking an offending IP address.
The purpose of this example is merely to illustrate how BeautifulSoup and Python can be used to quickly put together a screen-scraping example. This allows me to quickly check one of my favorite links daily without having to fire up the browser each time. The use of this script is your own responsibility.


1. This post assumes a working Python installation on your machine. If you don't have one, you can download it at http://www.python.org.

2. Download BeautifulSoup from here and save the resultant file in your local Python location. (e.g. C:\Python25\Lib\site_packages)

3. Check that BeautifulSoup works fine.
>> from BeautifulSoup import BeautifulSoup

(Watch out for a common mistake - import BeautifulSoup. This merely imports the module leading to later errors when you try to initialize the constructor BeautifulSoup() ) .

4. Copy the following code to a local folder as say apress.py and type
> python apress.py .

#apress.py

import urllib
from BeautifulSoup import BeautifulSoup

# open local/remote url

def get_data(url, local):
if (local) :
return open(url)
else :
return urllib.urlopen(url)


local = False
base_url = "http://www.apress.com"
deal_url = base_url + '/info/dailydeal'

# local testing
#deal_url = "c:\\mycode\\dailydeal2.htm"
#data = open(deal_url)

# remote url
data = get_data(deal_url, local)

bs = BeautifulSoup(data)

# screen-scrape the following HTML fragment
# to get book name, book description
'''<div class="bookdetails">

<h3><a href="http://www.apress.com/book/view/1590592778">The Definitive Guide to Samba 3</a></h3>

<div class="cover"><a href="http://www.apress.com/book/view/1590592778"><img src="dailydeal2_files/9781590592779.gif" alt="The Definitive Guide to Samba 3" align="left" border="0" width="125"></a></div>

<p>Samba
is an efficient file and print server that enables you to get the most
out of your computer hardware. If you're familiar with Unix
administration, TCP/IP networking, and other common Unix servers, and
you want to learn how to deploy the forthcoming revision of Samba, this
book is ideal for you. </p><div class="footer">$49.99 | Published Apr 2004 | Roderick W. Smith</div>

</div>
'''

book = bs.findAll(attrs= {"class" : "bookdetails"})
a = book[0].h3.find('a')

# grab URL to get book details later
rel_url = a.attrs[0][1]
abs_url_book_det = base_url + rel_url

# extract book name
book_name = a.contents[0] # just 1 name
print "Today's Apress $10 Ebook Deal:"
print book_name.encode('utf-8')

# extract book description
desc = book[0].p
print desc.contents[0] + '\n'

#extract book details

# local testing
#abs_url_book_det = "c:\\mycode\\bookdetails.htm"
#details = open(abs_url_book_det)

# remote url
details = get_data(abs_url_book_det, local)
bs = BeautifulSoup(details)

# screen-scrape the following HTML fragment
# to get book details
'''<div class="content" style="padding: 10px 0px; font-size: 11px;">

<a href="http://www.apress.com/book/view/9781590599419"><img src="bookdetails_files/9781590599419.gif" class="centeredImage" alt="Practical DWR 2 Projects book cover" border="0"></a>
<ul class="bulletoff">

<li>By Frank Zammetti </li>
<li>ISBN13: 978-1-59059-941-9</li>
<li>ISBN10: 1-59059-941-1</li>
<li>540 pp.</li>

<li>Published Jan 2008</li>
<li>eBook Price: $32.89</li>
<li>Price: $46.99</li>
'''

det = bs.find(attrs={"class" : "content"})

ul = det.find('li')
while (ul.nextSibling <> None):

if (ul == '\n') :
ul = ul.nextSibling
continue
line = ul.contents[0]
if line.startswith('eBook') :
print line + str(ul.contents[2])
else:
print line.encode('utf-8')
ul = ul.nextSibling

5. You should see a listing of the current $10 Ebook Deal of the Day .

Notes / Suggested Improvements :

1. In order to avoid unnecessary traffic on the Apress website, it would be a good idea to download the files locally and test locally before hitting their URLs. In order to do that :
  • make the variable local = True
  • uncomment the lines that say "local testing" and comment out the corresponding lines that say "remote url". This will call the get_data function with local = True causing it to load the local pages. Of course you'll need to change the lines in the code to point to correct HTML pages in your local disk.
2. The following Python tutorial has an excellent explanation on the use of urllib2 a to submit a "polite" HTTP request to a website as opposed to the quick and dirty approach illustrated above.

http://www.diveintopython.org/http_web_services/summary.html

3. A Book class to wrap and return book details.

4. The code above is only valid as long as the Apress folks do not change their HTML structure. If they did, one would need to rework the navigation to get at the correct data. This is one of the known perils of screen-scraping.

5. If you copy-paste the code, be sure to fix the indentation for the function and 'if' statements.

References :

0 comments: