feedextractor - a quick and dirty python script to grab lots of feeds from web pages
While looking for new feeds to add to my RSS reader (NetNewsWire), I thought it might be nice to have a utility that would let me grab a web page, spider all of the outbound links, check to see which pages had feeds, and then create an opml file of new feeds I didn’t have already.
How’s that for a run-on sentence?
Alright, so in addition to being too lazy to click on every link, I’m also too lazy to write fancy code for this project. What I wanted was something quick and dirty. Something that got me 80% of the way there.
feedextractor.py is where I ended up.
This little python script uses the wonderful BeautifulSoup xml/html parsing library from Crummy Software. I highly recommend the soup and Lewis Carroll’s Alice in Wonderland.
Using feedextractor.py
1. Export you current list of feeds in opml format.
2. Rename the export file to “mysubscriptions.opml”
3. Place the export file in the same directory as feedextractor.py
4. Change the baseurl variable in feedextractor.py to the url of the page you would like start from
5. Run feedextractor.py (i.e. [python feedextractor.py])
This will create a file called newfeeds.opml with all of the spidered feeds that do not appear to be in your current list of feeds.
Known “problems”
1. The script only takes the first feed from a found site. If there is an RSS and an ATOM feed, the script will grab whichever one is at the top of the file. This means that multiple feeds are ignored. You might think this is a bad thing. If so, feel free to change it. I set it up this way because I didn’t want to comb through the found feeds and delete what amounts to duplicates.
2. The script has just one exception block. If something happens while trying to pull back a page, the script skips that site. Could be more elegant.
3. The script does not bother with parameters. Would be nice if you could just pass in a url… I know this is simple, but again I am in a hurry. I just wanted it to work. I’m not making a project here.
4. urllib2 gets rejected by some sites. True enough. Some web servers will reject a request from urllib2. If you want to go to the trouble of adding a user agent header, be my guest.
It works
This script comes as-is. Use it to your heart’s content. I’m not planning updates or anything else. Just a fun bit o’ code I whipped up to suit a need.
But it does work, and quite efficiently too (even for some sloppy-quick hacking).
Raw source after the jump…
#!/usr/bin/env python
# encoding: utf-8
"""
feedextractor.py
Created by Jamie Grove on 2008-01-30.
"""
from BeautifulSoup import BeautifulSoup
import urllib2
from xml.dom import minidom
from urlparse import urlparse
import socket
timeout = 15
socket.setdefaulttimeout(timeout)
subscribedsites = []
subscribedfeeds = []
# put your seed url here
baseurl = 'http://www.please-change-to-some-site.com/or-full/url.html'
# loadsubscriptions - imports your current opml list
def loadsubscriptions():
global subscribedsites,subscribedfeeds
dom = minidom.parse('mysubscriptions.opml')
for node in dom.getElementsByTagName('outline'):
subscribedsites.append(node.attributes['htmlUrl'].value)
subscribedfeeds.append(node.attributes['xmlUrl'].value)
# gethtml(url) - fairly obvious, right?
def gethtml(url):
html = urllib2.urlopen(url).read()
return html
# extractlinks(html) - pulls out all the anchor tags from the html, skips sites you already have
# uses beautifulsoup to extract links
# 1) checks to see if the netloc of the anchor is in the list of subscribed sites
# 2) checks to see if the netloc of the anchor is in the list of links (keeps out the dupes)
# 3) checks to see if the netloc of the anchor is in the seed url
def extractlinks(html):
global subscribedsites,subscribedfeeds,baseurl
soup = BeautifulSoup(html)
anchors = soup.findAll('a')
links = []
for a in anchors:
o = urlparse(a['href'])
if len([s for s in subscribedsites if o.netloc in s]) == 0 and len([s for s in links if o.netloc in s]) == 0 and o.netloc not in baseurl:
links.append(a['href'])
return links
# getfeed(html) - looks for feed URLs in the html you pass in
# uses beautifulsoup to extract links
# same basic logic as extract links to make sure you only get feeds you don't have
def getfeed(html):
global subscribedsites,subscribedfeeds,baseurl
soup = BeautifulSoup(html)
linkedfiles = soup.findAll('link')
feed = []
for l in linkedfiles:
if l.has_key('rel'):
if l['rel'] == 'alternate':
o = urlparse(l['href'])
if len([s for s in subscribedfeeds if o.netloc in s]) == 0 and len([s for s in feed if o.netloc in s['href']]) == 0:
feed.append({'href':l['href'],'title':l['title']})
return feed
# main - unimaginative? yes, but it works
# 1) creates a opml stub
# 2) grabs the seed page and parses it for new links
# 3) goes out and gets feeds (if they exist)
# 4) adds feeds to the stub opml
# 5) writes the opml file out for import elsewhere
def main():
global subscribedsites,subscribedfeeds,baseurl
loadsubscriptions()
html = gethtml(baseurl)
links = extractlinks(html)
xml = minidom.Document()
opml = xml.createElement('opml')
opml.appendChild(xml.createElement('head'))
body = xml.createElement('body')
print '%d links' % len(links)
counter = 0
for l in links:
counter = counter + 1
print 'processing link %d - %s' % (counter,l.encode('latin-1'))
try:
html = gethtml(l)
feed = getfeed(html)
if len(feed) > 0:
for f in feed:
outline = xml.createElement('outline')
outline.setAttribute('title',f['title'])
outline.setAttribute('htmlUrl',l)
outline.setAttribute('xmlUrl',f['href'])
body.appendChild(outline)
except:
print 'Could not get %s'% l.encode('latin-1')
opml.appendChild(body)
xml.appendChild(opml)
fp = open("newfeeds.opml","w")
# writexml(self, writer, indent='', addindent='', newl='', encoding=None)
xml.writexml(fp, " ", "", "\n", "UTF-8")
if __name__ == '__main__':
main()
P.S. Looking for a good book on Python Network Programming? I highly recommend John Goerzen’s Foundations of Python Network Programming. John’s style is engaging and easy to read. His examples are practical and clear (way better than the shoddy code I wrote above).
This book will have you spinning ideas and code so fast you’ll wonder how you got along without it.











Awesome, thanks!
I’ve been looking for one of these off and on for a couple weeks now and tonight decided to (finally) learn python and write it in that, mostly because I thought there would probably be someone out there who has either done it and posted it or, done a lot of the little peices that I could take and piece together. Thanks for your generosity!!
-Brian
Awesome, thanks!
I’ve been looking for one of these off and on for a couple weeks now and tonight decided to (finally) learn python and write it in that, mostly because I thought there would probably be someone out there who has either done it and posted it or, done a lot of the little peices that I could take and piece together. Thanks for your generosity!!
-Brian
Brian Abbott’s last blog post..delete this;