Extracting data from websites with Ruby and Nokogiri

Maybe you, in some day, have faced this problem: you're wanting to show some data in your application and there's no webservice or any other way to get it. So, you know that the information you want is available in some websites and you ask yourself: how can I get this?

Waves is the brazilian most complete website for surf community, there's a huge number of informations about the waves in many surf spots around the country and that's my target. My idea is to get these informations and distribute them as a REST webservice.

Show me the code

First thing I did was create a class that represents the report itself.

# report.rb
require 'wapi/parser'
require 'nokogiri'
require 'open-uri'

module Wapi
  class Report
    attr_reader :html

    WAVES_URL = 'http://waves.terra.com.br/surf/ondas'

    def initialize(beach, url=WAVES_URL)
      beach_url = "#{url}#{beach}"
      @html = Nokogiri::HTML(open(beach_url))

This class constructor receives the beach URL and the endpoint of the Waves website with the default value WAVES_URL created as a constant before. As you can see, I've used the awesome Nokogiri to parse the HTML.

Now let's create a method that returns a hash with all the information we need.

def check
  conditions = {}

  ConditionParser::constants.each do |constant|
      conditions[constant.downcase] = ConditionParser.const_get(constant).extract(@html)

In this method I used Duck Typing to simplify each condition parsing. In other words, for each parser I call extract method that returns an object with the information needed and store it in conditions.

This module ConditionParser has a list of classes that are related to each information that should be extracted from the page and every parser should have the static method self.extract(html). So, imagine you want to extract the full name of the beach in the page:

# name.rb
module Wapi
  module ConditionParser  
    class Name
      def self.extract html
        html.css("#content h1")[0].content

In every parser, we can manipulate HTML data using a query selector to find the content we want. In this case, I'm searching for #content h1 selector, getting its first occurrence and extract the inner content of it.

Finally, extracting data from websites is a pretty simple task when you are using Ruby, but you should know that every single HTML change on page will make you to adapt your parser to the new code.

You can see this code on my Github as a Rails plugin and the REST webservice you can access here.