Nicholas' Adventures

programming, politics, food, life

Nokogiri - Cut With Precision

Many times we as developers have to deal with complex data, be it an ActiveResource result set or a HTML/XML document.  Trying to parse data out of these using for each and nesting loops within loops can be cumbersome.  A more elegant solution is to use nokogiri and xpath.

Nokogiri is a type of Japanese saw, it also is a gem in Ruby that you can use to easily deal with XML or HTML documents.  (hint, ActiveRecord and ActiveResource objects both have to_xml methods).  You can easily install nokogiri (make sure you have libxml2 development packages installed, as the gem requires these to be properly built).

$ sudo gem install nokogiri

Now consider the following XML document: foods.xml

Before we can work with our data we need to read XML into Nokogiri. This is easy to accomplish:

> require 'rubygems'
> require 'nokogiri'
> doc = Nokogiri::XML.parse(File.read('foods.xml'))
=> #<Nokogiri::XML::Document:0x3f930c9db884 ...

What we are returned is a Nokogiri document which is a collection of Nokogiri elements and text objects. The document supports seaching (selecting a subset of nodes, or nokogiri nodeset) by both CSS selectors or XPath notation. These are returned as an array of elements and text objects.

So for example if we wanted to know all the names of the food items in our document we simply say:

> doc.xpath("//name").collect(&:text)
=> ["carrot", "tomato", "corn", "grapes", "orange", "pear", "apple"]

If we were interested in the entire node we could leave off the .collect(&:text). What if we wanted to select all the names of food items that were best baked?  This requires us to use what’s called an axis - we will first need to find the element “baked” but then go back up our XML elements to find which food the item is inside.

> doc.xpath("//tag[text()='baked']/ancestor::node()/name").collect(&:text)
=> ["pear", "apple"]

What if we were only interested in vegetables that were good for roasting?  Just add //veggies:

> doc.xpath("//veggies//tag[text()='roasted']/ancestor::node()/name").collect(&:text)
=> ["carrot", "tomato"]

What about if we wanted to know all the tags ‘corn’ had?  Again this is very easy:

> doc.xpath("//name[text()='corn']/../tags/tag").collect(&:text)
=> ["raw", "boiled", "grilled"]

We can even do searches matching the first character.  Let’s say we wanted to know all the food items that started with the letter ‘c’:

> doc.xpath("//name[starts-with(text(),'c')]").collect(&:text)
=> ["carrot", "corn"]

You have to admit this is pretty cool stuff.  You could also use [contains(text(),’rot’] and get back just carrot, useful when you want to do a partial match.  Axis combined with selectors give you a wide variety of options for parsing your dataset.  You can also match using operations.  See the links below for resources on the variety of options available.

Xpath is Powerful

Xpath lets us select XML elements, attributes and text without having to write cumbersome recursive, nested loops. Below are  links to online resources and tutorials.  The next time you have to dig through an XML document or ActiveResource result, don’t use recursive, nested loops; instead, consider a Japanese saw - nokogiri.

Learn More

Comments