<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Field Guide to Programmers &#187; Open Source</title>
	<atom:link href="http://www.fieldguidetoprogrammers.com/category/open-source/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.fieldguidetoprogrammers.com</link>
	<description>Code, Toys, Bits of Odd Fluff</description>
	<lastBuildDate>Fri, 19 Jun 2009 16:05:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>feedextractor &#8211; a quick and dirty python script to grab lots of feeds from web pages</title>
		<link>http://www.fieldguidetoprogrammers.com/python/feedextractor-a-quick-and-dirty-python-script-to-grab-lots-of-feeds-from-web-pages/</link>
		<comments>http://www.fieldguidetoprogrammers.com/python/feedextractor-a-quick-and-dirty-python-script-to-grab-lots-of-feeds-from-web-pages/#comments</comments>
		<pubDate>Tue, 05 Feb 2008 02:18:13 +0000</pubDate>
		<dc:creator>jamiegrove</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.fieldguidetoprogrammers.com/blog/python/feedextractor-a-quick-and-dirty-python-script-to-grab-lots-of-feeds-from-web-pages/</guid>
		<description><![CDATA[While looking for new feeds to add to my RSS reader (NetNewsWire), I thought it might be nice to have a utility that would let me grab a web page, spider all of the outbound links, check to see which pages had feeds, and then create an opml file of new feeds I didn&#8217;t have [...]]]></description>
			<content:encoded><![CDATA[<p>While looking for new feeds to add to my RSS reader (NetNewsWire), I thought it might be nice to have a utility that would let me grab a web page, spider all of the outbound links, check to see which pages had feeds, and then create an opml file of new feeds I didn&#8217;t have already.</p>
<p>How&#8217;s that for a run-on sentence? <img src='http://www.fieldguidetoprogrammers.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Alright, so in addition to being too lazy to click on every link, I&#8217;m also too lazy to write fancy code for this project.  What I wanted was something quick and dirty.  Something that got me 80% of the way there.</p>
<p>feedextractor.py is where I ended up.</p>
<p>This little python script uses the <a href="http://www.crummy.com/software/BeautifulSoup/">wonderful BeautifulSoup xml/html parsing library</a> from Crummy Software.  I highly recommend the soup and Lewis Carroll&#8217;s Alice in Wonderland.</p>
<h2>Using feedextractor.py</h2>
<p>1. Export you current list of feeds in opml format.<br />
2. Rename the export file to &#8220;mysubscriptions.opml&#8221;<br />
3. Place the export file in the same directory as feedextractor.py<br />
4. Change the baseurl variable in feedextractor.py to the url of the page you would like start from<br />
5. Run feedextractor.py (i.e. [python feedextractor.py])</p>
<p>This will create a file called newfeeds.opml with all of the spidered feeds that do not appear to be in your current list of feeds.</p>
<h2>Known &#8220;problems&#8221;</h2>
<p>1. The script only takes the first feed from a found site.  If there is an RSS and an ATOM feed, the script will grab whichever one is at the top of the file.  This means that multiple feeds are ignored.  You might think this is a bad thing.  If so, feel free to change it.  I set it up this way because I didn&#8217;t want to comb through the found feeds and delete what amounts to duplicates.</p>
<p>2. The script has just one exception block.  If something happens while trying to pull back a page, the script skips that site.  Could be more elegant.</p>
<p>3. The script does not bother with parameters.  Would be nice if you could just pass in a url&#8230;  I know this is simple, but again I am in a hurry.  I just wanted it to work.  I&#8217;m not making a project here.</p>
<p>4. urllib2 gets rejected by some sites.  True enough.  Some web servers will reject a request from urllib2.  If you want to go to the trouble of adding a user agent header, be my guest.</p>
<h2>It works</h2>
<p>This script comes as-is.  Use it to your heart&#8217;s content.  I&#8217;m not planning updates or anything else.  Just a fun bit o&#8217; code I whipped up to suit a need.</p>
<p>But it does work, and quite efficiently too (even for some sloppy-quick hacking).</p>
<p><a href="http://www.fieldguidetoprogrammers.com/downloads/feedextractor.py">Download feedextractor.py</a></p>
<p>Raw source after the jump&#8230;<br />
<span id="more-69"></span></p>
<hr/>
<code></p>
<pre class="codebox" style="width:900px;">
#!/usr/bin/env python
# encoding: utf-8
"""
feedextractor.py

Created by Jamie Grove on 2008-01-30.

"""

from BeautifulSoup import BeautifulSoup
import urllib2
from xml.dom import minidom
from urlparse import urlparse
import socket

timeout = 15
socket.setdefaulttimeout(timeout)

subscribedsites = []
subscribedfeeds = []

# put your seed url here
baseurl = 'http://www.please-change-to-some-site.com/or-full/url.html'

# loadsubscriptions - imports your current opml list
def loadsubscriptions():
	global subscribedsites,subscribedfeeds
	dom = minidom.parse('mysubscriptions.opml')
	for node in dom.getElementsByTagName('outline'):
		subscribedsites.append(node.attributes['htmlUrl'].value)
		subscribedfeeds.append(node.attributes['xmlUrl'].value)

# gethtml(url) - fairly obvious, right?
def gethtml(url):
	html = urllib2.urlopen(url).read()
	return html

# extractlinks(html)  - pulls out all the anchor tags from the html, skips sites you already have
#   uses beautifulsoup to extract links
#   1) checks to see if the netloc of the anchor is in the list of subscribed sites
#   2) checks to see if the netloc of the anchor is in the list of links (keeps out the dupes)
#   3) checks to see if the netloc of the anchor is in the seed url
def extractlinks(html):
	global subscribedsites,subscribedfeeds,baseurl
	soup = BeautifulSoup(html)
	anchors = soup.findAll('a')
	links = []
	for a in anchors:
		o = urlparse(a['href'])
		if len([s for s in subscribedsites if o.netloc in s]) == 0 and len([s for s in links if o.netloc in s]) == 0 and o.netloc not in baseurl:
			links.append(a['href'])
	return links

# getfeed(html) - looks for feed URLs in the html you pass in
#   uses beautifulsoup to extract links
#   same basic logic as extract links to make sure you only get feeds you don't have
def getfeed(html):
	global subscribedsites,subscribedfeeds,baseurl
	soup = BeautifulSoup(html)
	linkedfiles = soup.findAll('link')
	feed = []
	for l in linkedfiles:
		if l.has_key('rel'):
			if l['rel'] == 'alternate':
				o = urlparse(l['href'])
				if len([s for s in subscribedfeeds if o.netloc in s]) == 0 and len([s for s in feed if o.netloc in s['href']]) == 0:
					feed.append({'href':l['href'],'title':l['title']})
	return feed

# main - unimaginative?  yes, but it works
#   1) creates a opml stub
#   2) grabs the seed page and parses it for new links
#   3) goes out and gets feeds (if they exist)
#   4) adds feeds to the stub opml
#   5) writes the opml file out for import elsewhere
def main():
	global subscribedsites,subscribedfeeds,baseurl
	loadsubscriptions()
	html = gethtml(baseurl)
	links = extractlinks(html)
	xml = minidom.Document()
	opml = xml.createElement('opml')
	opml.appendChild(xml.createElement('head'))
	body = xml.createElement('body')
	print '%d links' % len(links)
	counter = 0
	for l in links:
		counter = counter + 1
		print 'processing link %d - %s' % (counter,l.encode('latin-1'))
		try:
			html = gethtml(l)
			feed = getfeed(html)
			if len(feed) > 0:
				for f in feed:
					outline = xml.createElement('outline')
					outline.setAttribute('title',f['title'])
					outline.setAttribute('htmlUrl',l)
					outline.setAttribute('xmlUrl',f['href'])
					body.appendChild(outline)
		except:
			print 'Could not get %s'% l.encode('latin-1')
	opml.appendChild(body)
	xml.appendChild(opml)
	fp = open("newfeeds.opml","w")
	# writexml(self, writer, indent='', addindent='', newl='', encoding=None)
	xml.writexml(fp, "    ", "", "\n", "UTF-8")

if __name__ == '__main__':
	main()
</pre>
<p></code></p>
<hr/>
<p><b>P.S. Looking for a good book on Python Network Programming?</b>  I highly recommend John Goerzen&#8217;s Foundations of Python Network Programming.  John&#8217;s style is engaging and easy to read.  His examples are practical and clear (way better than the shoddy code I wrote above).</p>
<p>This book will have you spinning ideas and code so fast you&#8217;ll wonder how you got along without it.</p>
<p><iframe src="http://rcm.amazon.com/e/cm?t=authorstorecom&#038;o=1&#038;p=8&#038;l=as1&#038;asins=1590593715&#038;fc1=000000&#038;IS2=1&#038;lt1=_blank&#038;lc1=0000FF&#038;bc1=000000&#038;bg1=FFFFFF&#038;f=ifr" style="width:120px;height:240px;" scrolling="no" marginwidth="0" marginheight="0" frameborder="0"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://www.fieldguidetoprogrammers.com/python/feedextractor-a-quick-and-dirty-python-script-to-grab-lots-of-feeds-from-web-pages/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Corporate Finance and Open Source</title>
		<link>http://www.fieldguidetoprogrammers.com/it/corporate-finance-and-open-source/</link>
		<comments>http://www.fieldguidetoprogrammers.com/it/corporate-finance-and-open-source/#comments</comments>
		<pubDate>Wed, 09 Jan 2008 15:46:42 +0000</pubDate>
		<dc:creator>jamiegrove</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[Open Source]]></category>

		<guid isPermaLink="false">http://www.fieldguidetoprogrammers.com/blog/it/corporate-finance-and-open-source/</guid>
		<description><![CDATA[For those of you banging your head on why it is difficult to get needed IT expenditures approved, Paul Keeble offers the following, highly-accurate, explanation: In corporations Open source is thriving, and not because its the strategy: In large corporations if you need to buy a tool for development, especially one no one has used [...]]]></description>
			<content:encoded><![CDATA[<p>For those of you banging your head on why it is difficult to get needed IT expenditures approved, Paul Keeble offers the following, highly-accurate, explanation:</p>
<p><a href="http://www.jroller.com/BrightCandle/entry/in_corporations_open_source_is">In corporations Open source is thriving, and not because its the strategy</a>:</p>
<blockquote><p>In large corporations if you need to buy a tool for development, especially one no one has used before good luck in completing the order. The developer wanting the tool will have to justify it over and over and the odds of getting it aren‘t dependent on the need but whether the department (and hence what time of the year it is) has any money. Big purchases tend to be much easier because they have to be approved further up in the organisation. Little purchases just drag on and on. I don‘t know why it is the finance mechanism in companies causes so much grief, but the truth is they are not designed for quick purchases for strategic reasons. Any CFO that tells you otherwise is lying or has no idea. The real reason for finance control is to stop money leaking out and the company loosing money and not knowing why. The fact that redeveloping the library will take you 6 months is unfortunately lost on them as your an expected cost.</p></blockquote>
<p>[Via: <a href="http://www.jroller.com">JRoller</a>]</p>
<p>Paul goes on to expand this discussion to show how Open Source fits into the financing process which is a key reason for its continuing adoption by corporations.</p>
<p>Of course, this doesn&#8217;t explain why some IT shops are afraid of open source, but my gut tells me it&#8217;s about responsibility.  IT shops unwilling to take responsibility for their own decisions and solutions tend to rely on vendors who are ready to serve as punching bags during a crisis.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.fieldguidetoprogrammers.com/it/corporate-finance-and-open-source/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
