Programming tricks related to ActionScript, Python, C/C++ and Java. See the comments for answers.
Tuesday, May 5, 2009
Python: extract URLs in web pages
Given a URL of a web page, extract all the URLs from that web page. Do this in one line. Assume that URLs in the web page are of the form <a href="...">...</a>. Bonus Question: also return the text in the <a>text</a> tag for the given URL.
Without giving much thought, I came up with the following to extract the URLs. It uses the urllib and re packages. Assuming the web page url is in 'url' variable, the following returns all the URLs in that web page.
You can store the return value (which is a list) in a variable. For example, the following command stores the URLs in links, which is then printed out. Note that both triple quote ''', and double followed by single quote "' are used below.
4 comments:
Without giving much thought, I came up with the following to extract the URLs. It uses the urllib and re packages. Assuming the web page url is in 'url' variable, the following returns all the URLs in that web page.
>>> re.findall('href=[\"\'](.[^\"\']+)[\"\']', urllib.urlopen(url).read(), re.I)
is there any way to store the urls if we extract it in this manner..i mean in a string or a list?
You can store the return value (which is a list) in a variable. For example, the following command stores the URLs in links, which is then printed out. Note that both triple quote ''', and double followed by single quote "' are used below.
>>> import re, urllib
>>> links = re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen('http://kundansingh.com').read(), re.I)
>>> print links
Ah, I understood quick than I thought I would!
links = re.findall('''href=["](.[^"]+)["]''', urllib2.urlopen(u).read(), re.I)
fixes the problem I think :)
Post a Comment