Hpricot Case Sensitivity

24 April 2008

Recently we were trying to use the awesome Hpricot to do some HTML parsing. Problem is that Hpricot doesn't easily allow you to do case insensitive searches for elements.

This means that we're missing any element where someone has placed an uppercase character in the tag we're explcitly looking for.

1
2
3
4
>> h.at("meta[@name=description]")
=> {emptyelem meta name"description" content"David Smalley, Ruby hacker based in Leeds, UK"}
>> h.at("meta[@name=Description]")
=> nil

This problem has been talked about on the Mofo Mailing List and even on the Hpricot challenge page - but I didn't really like the look of those solutions as downcase-ing everything would distort the original HTML.

So I devised this (horrible and hacky) bit of code that loops through all element names and attributes, downcase-ing them all without touching any innerHTML.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  def normalize(element)
    element.children.each do |child|
      if child.respond_to?(:name=)
        child.name = child.name.downcase if child.name
      end
      if child.respond_to?(:raw_attributes=)
        attribs = {}
        child.raw_attributes.each_pair do |key,value|
          attribs[key.downcase] = value.downcase if value
        end
        child.raw_attributes = attribs
      end
      normalize(child) if child.respond_to?(:children)
    end
    return element
  end
>> h = Hpricot(open("http://davidsmalley.com"))
=> Hpricot.....<snip>
>> normalize(h)

Now all your element names and tags will be downcase-ed and all innerHTML will be left alone. Comments, feedback and suggestions very much welcome!


Comments

Sorry, comments are closed for this article.