Writing Web Clients
Web Clients -- The Tutorial
- Welcome
- Gimmick -- Buffy quotes
Anya (Family, season 5) -- Thank you for coming. We value your patronage.
What Are Web Clients?
- Clarification: non-interactive web clients
- Special purpose
- Often, quick and dirty hacks
- Make a web page into API
Giles (Family, season 5) -- Could we please be a little less effusive, Anya?
What Are Web Clients Useful For?
- Mass download
- Periodic checking
- Automating tasks
- Make a web page more friendly
Harmony (Family, season 5) -- Aww. You're my little lamb.
Review of Modules
- htmllib
- sgmllib
- httplib
- urllib
- urllib2
- urlparse
Buffy (Family, season 5) -- Your definition of narrow is impressively wide.
Modules -- htmllib
- Most useful for easy filtering of images
- ...or links
- Other things often easier with sgmllib
- Or with re
- Or with string manipulation
Xander (Family, season 5) -- The answer is somewhere here.
Modules -- htmllib -- idiomatic usage
# For lists
import htmllib, formatter
h = htmllib.HTMLParser(formatter.NullFormatter())
h.feed(htmlString)
print h.anchorlist
Xander (Family, season 5) -- I'm helping, I'm reading, I'm quiet.
Modules -- htmllib -- idiotmatic usage (cont'd)
import htmllib, formatter
class IMGFinder(htmllib.HTMLParser):
def __init__(self, *args, **kw):
htmllib.HTMLParser.__init__(self, *args, **kw)
self.ims = []
def handle_image(self, src, *args): self.ims.append(src)
h = IMGFinder(formatter.NullFormatter())
h.feed(htmlString)
print h.ims
Donny (Family, season 5) -- Look what I found!
Modules -- htmllib -- base
- Some sites use 'base' for different relative linking
- For example, Zope does
- In above examples, 'h.base' has the base
Dawn (Family, season 5) -- This is the source of my gladness.
Modules -- htmllib -- base (example)
- If the page on http://example.com/foo/bar.html has a link to '../baz.html'
- It means http://example.com/baz.html
- If the original page has base='/foo/quux'
- It means http://example.com/foo/baz.html
Riley (Family, season 5) -- Every time I think I'm getting close to you...
Modules -- urllib/urllib2
- High-level interface
- Treat URLs as file-like objects
- ...but still allows low-level operations
- Interface largely compatible
Glory (Family, season 5) -- I am great and I am beautiful.
Modules -- urllib/urllib2 (cont'd)
- Can work through object-interface
- More flexible
- Interface no longer compatible
- urllib2 better usually
Joyce (Ted, season 2) -- He redid my entire system.
Modules -- urllib/urllib2 (examples)
- urllib.urlopen("http://www.yahoo.com/").read() -> contents
- urllib.urlopen("http://www.yahoo.com/").info() -> headers
- Same works with urllib2
- Automatically uses environment variables for proxies
- urllib2 supports proxies with authentication
Xander (Ted, season 2) -- Yum-my!
Digression -- HTTP Overview
- Request/Response
- Request is command followed by headers followed by body
- Response is error code followed by headers followed by body
- No welcome message
Tara (Family, season 5) -- ...in terms of the karmic cycle.
Example HTTP Sessions
GET /foo/bar.html HTTP/1.0
Host: www.example.org
<blank line>
HTTP/1.0 200 OK
Content-Type: text/html
<html><body>lalalala</body></html>
Giles (Family, season 5) -- And you are talking about what on earth?
Modules -- httplib
- Low-level interface to innards of HTTP
- Absolute control
- No abstractions
Mr. MacLay (Family, season 5) -- We know how to control her...problem.
Modules -- httplib -- example
- Note: usually, the Host header is important
>>> import httplib
>>> h=httplib.HTTP("moshez.org")
>>> h.putrequest('GET', '/')
>>> h.putheader('Host', 'moshez.org')
>>> h.endheaders()
>>> h.getreply()
(200, 'OK', <mimetools.Message instance at 0x81220dc>)
>>> h.getfile().read(10)
"<HTML>\n<HE"
Anya (Family, season 5) -- ...and it was fun!
Modules -- urlparse
- urlparse.urljoin -- like os.path.join for URLs
- For path manipulation
- urlparse.urlsplit
- urlparse.urlunsplit
Buffy (Family, season 5) -- You know what, you guys, just leave it here.
Downloading Dilbert
import urllib2, re
URL = 'http://www.dilbert.com/'
f = urllib2.urlopen(URL)
s = f.read()
href = re.compile('<a href="(/comics/.*?/dilbert.*?gif)">')
m = href.search(value)
f = urllib2.urlretrieve(urlparse.urljoin(URL, m.group(1)),
"dilbert.gif")
Tara (Family, season 5) -- That was funny if you [...] are a complete dork.
Downloading Dark Angel Transcripts
- Common situation of mass download
import urllib2, htmllib, formatter, posixpath
URL="http://www.darkangelfan.com/episode/"
LINK_RE = re.compile('/trans_[0-9]+\.shtml$')
s = urllib2.urlopen(URL).read()
h = htmllib.HTMLParser(formatter.NullFormatter())
h.feed(s)
links = [urlparse.urljoin(URL, link)
for link in h.anchorlist if LINK_RE.search(link)]
### -- really download --
for link in links:
urllib2.urlretrieve(link, posixpath.basename(link))
Intern (Family, season 5) -- Yeah. That makes like five this month.
Downloading Dark Angel Transcripts (select)
class Downloader:
def __init__(self, fin, fout):
self.fin, self.fout, self.fileno = fin, fout, fin.fileno
def read(self):
buf = self.fin.read(4096)
if not buf:
for f in [self.fout, self.fin]: f.close()
return 1
self.fout.write(buf)
Joyce (Ted, season 2) -- I've been looking for the right moment.
Downloading Dark Angel Transcripts (select, cont'd)
- Same code up to 'really download'
downloaders = [Downloader(urllib2.urlopen(link),
open(posixpath.basename(link), 'wb'))
for link in links]
while downloaders:
toRead = select.select(None, [downloaders], [], [])
for downloader in toRead:
if downloader.read():
downloaders.remove(downloader)
Buffy (Family, season 5) -- Tara's damn birthday is just one too many things for me to worry about.
Downloading Dark Angel Transcripts (threads)
import threading
for link in links:
Thread(target=urllib2.urlretrieve,
args=(link,posixpath.basename(link)))
Buffy (Ted, season 2) -- Sounds like fun.
Digression - twisted.web.client
- Part of the Twisted networking framework
- High level interface to HTTP client
- Completely asynchronous
- Reports results via callbacks
- client.getpage("http://www.yahoo.com").addCallbacks(gotResult, gotError)
Buffy (Ted, season 2) -- You're supposed to use your powers for good!
Downloading Dark Angel Transcripts (web.client)
from twisted.web import client
from twisted.internet import import reactor, defer
defer.DeferredList(
[client.downloadPage(link, posixpath.basename(link))
for link in links]).addBoth(lambda _: reactor.stop())
reactor.run()
Ted (Ted, season 2) -- You don't have to worry about anything.
HTTP Authentication
- Client attempts to connect
- Server sends back a 401 (please authenticate)
- Client sends same request back -- with auth tokens
- Only HTTP Basic authentication widely supported
- Client can send auth tokens on more requests automatically
Buffy (Ted, season 2) -- Ummm... Who are these people?
HTTP Authentication - manually
- In HTTP, authentication is a header
- Base authentication is sending username and password
user = 'moshez'
password = 's3kr1t'
import httplib
h=httplib.HTTP("localhost")
h.putrequest('GET', '/protected/stuff.html')
h.putheader('Authorization',
base64.encodestring(user+":"+password).strip())
h.endheaders()
h.getreply()
print h.getfile().read()
Tara (Family, season 5) -- And, uh, these are my-my friends.
HTTP Authentication - urllib2
- Can read username/password from URL
- urllib2.urlopen("http://moshez:s3krit@example.com"
"/protected/stuff.html")
Xander (Ted, season 2) -- I am really jinxing the hell out of us.
Further Reading
Willow (Ted, season 2) -- 'Book-cracker Buffy', it's kind of her nickname.
Questions?
Buffy (Family, season 5) -- I let you come, now sit down and look studious.
Bonus Slides
Tara (Family, season 5) -- You always make me feel special.
Cookies
- Carry state from one page to another
- Server sends header: Set-Cookie
- Client sends on later requests header: Cookie
Ted (Ted, season 2) -- Who's up for dessert? I made chocolate-chip cookies!
urllib2 cookies
- Unfortunately, no automatic cookie jar support
- Can manually use .info() to read cookies...
- ...and the Request() API to send them to the server
Joyce (Ted, season 2) -- Mm! Buffy, you've got to try one of these!
Logging Into Advogato
import urllib2
u = urllib2.urlopen("http://advogato.org/acct/loginsub.html",
urllib2.urlencode({'u': 'moshez',
'pass': 'not my real pass'})
cookie = u.info()['set-cookie']
cookie = cookie[:cookie.find(';')]
r = Request('http://advogato.org/diary/post.html',
urllib2.urlencode(
{'entry': open('entry').read(), 'post': 'Post'}),
{'Cookie': cookie})
urllib2.urlopen(r).read()
Anya (Family, season 5) -- I have a place in the world now.
On Being Nice - Robots
- Some sites don't want automatic crawlers
- It is up to you whether to play nice
- But you should know the rules before you break them
- Robots file -- at /robots.txt
Willow (Ted, season 2) -- There were design features in that robot that pre-date...
Using robotparser
import robotparser
rp = robotparser.RobotFileParser()
rp.set_url('http://www.example.com/robots.txt')
rp.read()
if not rp.can_fetch('', 'http://www.example.com/'):
sys.exit(1)
Buffy (Ted, season 2) -- Tell me you didn't keep any parts.
webchecker
- In the source distribution, in Tools/
- Understands robots.txt
- Can override which links gets chased
Willow (Ted, season 2) -- What do you mean, check him out?
websucker
- In the source distribution, in Tools/
- Uses webchecker as a module
- Saves the pages it downloads
Buffy (Ted, season 2) -- Find out his secrets, hack into his life.