Programming Tricks: Python: extract URLs in web pages

Tuesday, May 5, 2009

Python: extract URLs in web pages

Given a URL of a web page, extract all the URLs from that web page. Do this in one line. Assume that URLs in the web page are of the form <a href="...">...</a>. Bonus Question: also return the text in the <a>text</a> tag for the given URL.

4 comments:

Kundan Singh said...: Without giving much thought, I came up with the following to extract the URLs. It uses the urllib and re packages. Assuming the web page url is in 'url' variable, the following returns all the URLs in that web page.

>>> re.findall('href=[\"\'](.[^\"\']+)[\"\']', urllib.urlopen(url).read(), re.I); May 5, 2009 at 3:46 PM
Unknown said...: is there any way to store the urls if we extract it in this manner..i mean in a string or a list?; November 3, 2009 at 5:40 AM
Kundan Singh said...: You can store the return value (which is a list) in a variable. For example, the following command stores the URLs in links, which is then printed out. Note that both triple quote ''', and double followed by single quote "' are used below.

>>> import re, urllib
>>> links = re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen('http://kundansingh.com').read(), re.I)
>>> print links; November 8, 2009 at 10:20 AM
Leon said...: Ah, I understood quick than I thought I would!

links = re.findall('''href=["](.[^"]+)["]''', urllib2.urlopen(u).read(), re.I)

fixes the problem I think :); January 12, 2011 at 10:45 AM

Programming Tricks

Tuesday, May 5, 2009

Python: extract URLs in web pages

4 comments:

About Me

Blog Archive