Saturday, May 01, 2010

Retrieving Data from Blogger Export Files

Ruby is a powerful dynamic object-oriented programming language. I have been using it professionally and to accomplish automated tasks. Collecting comments to generate the pie charts in my previous article was trivial using Ruby, Nokogiri and FasterCSV.

Blogger offers export and import features available in its Dashboard (Settings > Basic). It generates an enormous XML file that can be used to move your blog onto another one or save your data for later. There is a Developer's Guide: Protocol available online.

The following code snippet allows one to inspect data contained in an XML file. The code is displayed on screen using pretty print since it does inspect + formatting.

require("nokogiri")
require("pp")

File.open("blog-04-29-2010.xml", "r") { |file|
xml = Nokogiri::XML(file)
pp xml
}
Inspecting results is one of the powerful techniques in the dynamic world, whether you use Ruby or Squeak Smalltalk or another dynamic programming language. In this case, it gives me an opportunity to know which objects are created upon each segment in an XML file. A single XML Element is shown in this partial output:

#(Element:0x230a6e0 {
name = "id",
namespace = #(Namespace:0x230adc0 {
href = "http://www.w3.org/2005/Atom"
}),
children = [ #(Text "tag:blogger.com,1999:blog-29346655.archive")]
})
The following code snippet allows one to inspect every entry available. On Blogger, entries hold comments among other things. The data structure is becoming clear and one can then retrieve data.

require("nokogiri")
require("pp")

File.open("blog-04-29-2010.xml", "r") { |file|
xml = Nokogiri::XML(file)
xml.xpath("/xmlns:feed/xmlns:entry").each { |e|
pp e
}
}
Notice the XML Attributes shown in this partial output:

#(Element:0x23601d0 {
name = "category",
namespace = #(Namespace:0x23538c2 {
href = "http://www.w3.org/2005/Atom"
}),
attributes = [
#(Attr:0x235e970 {
name = "scheme",
value = "http://schemas.google.com/g/2005#kind"
}),
#(Attr:0x235e934 {
name = "term",
value = "http://schemas.google.com/blogger/2008/kind#comment"
})]
})
The following code snippet uses information learned from inspecting, retrieves each comment and displays them on screen.

require("nokogiri")
require("pp")

File.open("blog-04-29-2010.xml", "r") { |file|
xml = Nokogiri::XML(file)
xml.xpath("/xmlns:feed/xmlns:entry").each { |e|
term = e.xpath("xmlns:category")[0].attribute("term").value
if (term =~ /comment$/)
puts "Date: #{e.xpath("xmlns:published")[0].text}\n" +
"Author: #{e.xpath("xmlns:author")[0].children[0].text}\n" +
"Content: #{e.xpath("xmlns:content")[0].children[0].text}\n\n"
end
}
}
A processed entry is shown in this partial output:
Date: 2010-04-26T00:59:59.982-04:00
Author: BackOrder
Content: A side note for the requirements sheet. We could come up with some catchy phrase to go along these screenshots, something like: "Look For Yourself, We Need Your Help!"

Ian.
Finally, I have used FasterCSV to create a CSV file and imported it in OO.o to generate the charts.

Easy as pie. Pie chart.

2 comments:

  1. Work in progress. I am trying to get SyntaxHighlighter running. :)

    ReplyDelete
  2. SyntaxHighlighter now works in Firefox. IE does not render the highlights at the moment and it may not work with other browsers. Feedback?

    ReplyDelete