Tuesday, May 5, 2009

Python: extract URLs in web pages

Given a URL of a web page, extract all the URLs from that web page. Do this in one line. Assume that URLs in the web page are of the form <a href="...">...</a>. Bonus Question: also return the text in the <a>text</a> tag for the given URL.

4 comments:

Kundan Singh said...

Without giving much thought, I came up with the following to extract the URLs. It uses the urllib and re packages. Assuming the web page url is in 'url' variable, the following returns all the URLs in that web page.

>>> re.findall('href=[\"\'](.[^\"\']+)[\"\']', urllib.urlopen(url).read(), re.I)

Unknown said...

is there any way to store the urls if we extract it in this manner..i mean in a string or a list?

Kundan Singh said...

You can store the return value (which is a list) in a variable. For example, the following command stores the URLs in links, which is then printed out. Note that both triple quote ''', and double followed by single quote "' are used below.

>>> import re, urllib
>>> links = re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen('http://kundansingh.com').read(), re.I)
>>> print links

Leon said...

Ah, I understood quick than I thought I would!

links = re.findall('''href=["](.[^"]+)["]''', urllib2.urlopen(u).read(), re.I)

fixes the problem I think :)