Using Yahoo pipes to create a "River of News" aggregate news source

published Mar 22, 2009 08:08   by admin ( last modified Mar 22, 2009 08:08 )

Recently, this blog was moved from one ISP to Amazon EC2. At the old location I used a perl script to periodically download 60 odd news sources in Atom and RSS formats, and then used some custom python code in combination with the feedparser module, to parse and present this in Plone. I used this for years and since the results were publically available I know at least one more person who used them (hi Johan!).

Yahoo pipes

With the new location I decided to ditch those old scripts and go for an external solution. The reason? It was so darn simple to do. Yahoo has an Internet control language called Yahoo pipes. It works by stringing together, visually, modules that each does a specific task, such as collecting data, filtering or aggregating. Data flows through connectors between the modules. The way it's done  brings back memories to me from the old visual programming language ProGraph.

With the old system I presented the five last entries for each news source, grouped by source. An effect of this is that I had to visually scan through a number of "stale" news sources that hadn't had anything updated for a while. Another way to present news feeds is to mash them all together and sort the on the item level. This is callled to make a "River of news". This is how I did it with Yahoo pipes:

That is just three modules, a fetch feed modules with a looong list of feed urls (the blue connector is coming up from the bottom of that module in the screen shot), a sorter, and the output. The sorter auto detects what criteria can be sorted on. You can see the output here. Click at the list tab to get away from the image flow.

To the left in the screen shot above you can see the modules available: Sources, user inputs, operators and more.

When you get what you wished for

Now, when I have a river of news feed, I realise that it is not always what you want. On of my sources lists popular keywords on Swedish blogs. That source is updated so frequently that the keywords often make it to near the top of my river of news. Some sources seem to give the same pub date to all their items, which means you will have to wade through them as you are, ehrm, I suppose "wading" in the river of news. Some news sources seem to give a new pub date for an item on any update of the item; I notice some news  persistently turning up among the top twenty items, when I know for a fact that I read that piece twelve hours ago. I also do not know right now how good Yahoo pipes is in mashing together different feed formats, such as Atom, RSS and versions of these.

Update 2009-03-23

There are a number of pipe construction on pipes.yahoo.com that deal with news feeds, and some are a lot more complicated than what I show above. This indicates that there is a lot of complexity to deal with still. About my complaint that some news items turn up on the front page long after they've been published, this seems to be a side effect of a subscription feed that shows the top x blogged about news items. So, the same news items "sticking" near the top of the combined feed is expected behaviour.