.

Coffee Powered

code and content

Don’t use strip_tags.

I ran into a rather surprising little problem earlier this week that I felt bore documenting. It turns out that the “simple” Rails strip_tags helper is massive overkill when you just want to strip markup out of a document. It offers a lot of functionality, but it comes at a pretty ugly performance cost.

Here’s the call graph for #strip_tags (as profiled in an application I’m working on). As you can see, it tokenizes the entire string, and then iterates the tokens, likely pushing and popping sections onto and off of a stack as tags are opened and closed.

This is a lot more than a quick little regex to strip out tags; it’s actually parsing the full HTML document. Fortunately, there are already tools to do that, and they have their slow parts written as C extensions. Nokogiri is my weapon of choice in this regard – it’s battle-tested and generally rocks at parsing markup, even when it’s poorly-formed.

So, let’s benchmark a “strip all the markup out of a string” use case with #strip_tags and nokogiri.

require 'rubygems'
require 'action_view'
require 'nokogiri'

include ActionView::Helpers::SanitizeHelper

f = open("news").read

LOOPS = 1000
Benchmark.bmbm do |x|
  x.report("strip_tags") { LOOPS.times { strip_tags f }}
  x.report("nokogiri") { LOOPS.times { Nokogiri::HTML(f).text }}
end

The data file in this case is a snapshot of the current page of Hacker News. It’s a 23kb HTML file. Nothing too huge, but certainly not small, either. Let’s run it through the benchmark:

[chris@luna projects]$ ruby strip.rb
Rehearsal ----------------------------------------------
strip_tags  33.070000   0.010000  33.080000 ( 33.092638)
nokogiri     3.220000   0.020000   3.240000 (  3.241090)
------------------------------------ total: 36.320000sec

                 user     system      total        real
strip_tags  33.010000   0.010000  33.020000 ( 33.056551)
nokogiri     3.190000   0.000000   3.190000 (  3.200680)

Yikes. It’s not just slower, it’s ~10x slower.

Don’t use strip_tags. Also, profile your code. But just because it’s convenient doesn’t mean you should use it.

Enabling brightness controls on an HP Envy 17 under Fedora 16

I’ve recently set up Fedora 16 on my laptop, and all has been smooth, save for the brightness switches. The on-screen display would show up when I used the fn-F2/fn-F3 key combinations, but the brightness just wouldn’t change. Additionally, the brightness was stuck at the lowest level.

Turns out there’s a pretty easy fix in the form of a couple of module parameters:

In /etc/defaults/grub, add the following kernel parameters:

video.brightness_switch_enabled=1 video.use_bios_initial_backlight=0

(You may also want to add radeon.modeset=1 and acpi_osi=Linux for this particular machine, but they aren’t related to the brightness fix.)

Then update your grub2 config:

 grub2-mkconfig > /boot/grub2/grub.cfg 

Reboot, and your brightness controls should work as expected. The brightness slider in GNOME still doesn’t work, but I’m content with hardware brightness controls over no brightness controls.

Comps – design vs reality comparisons in Chrome

For a long time, I used the PixelPerfect Firefox add-on to compare rendered comps with my finished web work. This was a fast and effective way to make sure that I got the spacings, font sizes, and other such things done properly.

However, PixelPerfect doesn’t work all that well (well, at all) anymore, and Firefox is no longer my browser of choice. There weren’t any good options for Chrome, so I wrote one.

You can grab it straight from the Chrome store if you want (it’s free!), or you might be interested in perusing the source code.

Comps is very lightweight, and gets straight to the point – click the button, drop your comp into the webpage, and position it to make direct comparisons. Mousewheel over the comp to change its opacity, or you can toggle it on and off using the thumbnail drawer or the Comps button. It’ll remember your settings between sessions (and even pageloads!) so that you can tweak and refresh to your heart’s content, and instantly have your comp right there for comparison.

Give it a shot, let me know if/how you like it.

Rails Cookie Sessions and PHP

I recently found myself needing to share session data from my Rails app with a PHP app on the same domain. We use cookie sessions for a number of reasons, and while they work great, the data stored in them is stored in Ruby’s native Marshal format, which is not trivial to reimplement in PHP. After trying to get the data unmarshaled for a bit, I had another idea – why not just change the storage format?

Fortunately, Ruby is deeply entangled with another more portable serialization format: YAML.

Rails manages its session cookies through the MessageVerifier. Easy enough – we can just write our own MessageVerifier that uses YAML rather than Marshal.


module ActiveSupport
  class YamlMessageVerifier < MessageVerifier
    def verify(signed_message)
      raise InvalidSignature if signed_message.blank?

      data, digest = signed_message.split("--")
      if data.present? && digest.present? && secure_compare(digest, generate_digest(data))
        str = ActiveSupport::Base64.decode64(data)
        if str[0..2] == '---'
          YAML::load str
        else # Handle old Marshal.dump'd session
          Marshal.load(str)
        end
      else
        raise InvalidSignature
      end
    end

    def generate(value)
      data = ActiveSupport::Base64.encode64s(YAML::dump value)
      "#{data}--#{generate_digest(data)}"
    end
  end
end

You’ll notice that verify() can accept a Marshaled session as well; this lets you transparently transition existing cookies to the new format without any kind of session breakage. Nice.

Now, to use the verifier, we monkeypatch CookieStore:

module ActionController
  module Session
    class CookieStore
      def verifier_for(secret, digest)
        key = secret.respond_to?(:call) ? secret.call : secret
        ActiveSupport::YamlMessageVerifier.new(key, digest)
      end
    end
  end
end

Now, this will work…at least at first glance, until you try to use the flash. This is a particularly nasty little problem, and it stems from the fact that Ruby’s YAML implementation serializes Hash objects without their instance variables, and FlashHash inherits from Hash, and thus inherits its serialization/deserialization strategy. I worked for a while to monkeypatch those strategies, but I didn’t like the result, and it felt a little hacky. Instead, I just took advantage of the YAML load lifecycle to make sure the FlashHash initializes properly:

module ActionController
  module Flash
    class FlashHash
      def update_with_initializer(h)
        @used ||= {}
        update_without_initializer(h)
      end
      alias_method_chain :update, :initializer
    end
  end
end

The core problem is that YAML::load calls Hash#update, and FlashHash presumes that the @used instance variable is present and initialized to an empty hash. To fix that, I just aliased in an initializer to make sure that variable is set.

Note that if you are storing other Hash subclasses with instance variables that rely on those variables being persisted across sessions, they will break. However, you should only be storing primitive/array/hash data in the session if possible. FlashHash is sort of a nasty violation of this principle.

At this point, your session should be serializing to and from YAML. We’ll want to read it from PHP, naturally. I’m using SPYC in the PHP project, which gets us Close Enough(TM). It doesn’t handle symbol keys, but we’ll handle those in the PHP itself.

Reading from PHP

Reading the data back out is surprisingly simple. We have to verify the authenticity of the data, of course, by checking the hash, but then you just base64 decode the data, load it with spyc, and perform some simple transformation to turn symbols into strings. If you wanted to make it even easier, you could monkeypatch the cookie store to call #stringify_keys! on your session hash before serializing it (and then call #with_indifferent_access on the hash when you deserialize it. Be aware of the speed impact of such a decision before you do it.)

function explode_symbols($arr) {
  $result = array();
  foreach($arr as $key => $val) {
    if(is_numeric($key) && $val[0] == ":") {
      $bits = explode(":", $val, 3);
      $result[trim($bits[1])] = trim($bits[2]);
    } elseif (is_array($val)) {
      $result[$key] = explode_symbols($val);
    } else {
      $result[$key] = $val;
    }
  }
  return $result;
}

function deserialize_session($session_key, $secret) {
  list($session64, $hash) = explode("--", $_COOKIE[$session_key], 2);
  if(hash_hmac("SHA1", $session64, $secret) == $hash) {
    $session = base64_decode($session64);
    return explode_symbols(spyc_load($session));
  } else {
    throw new Exception("Invalid session signature");
  }
}

$rails_session = deserialize_session("your_session_cookie_name", $your_session_cookie_secret);

Caveats

  • Be aware that YAML is slower than Marshal
  • Be aware that storing Hash subclasses in the session is likely going to Not Work.

And that’s all there is to it. You can now share data between the two apps via the session cookie.

Restarting Resque workers (or anything, really) with Monit, Passengers-style.

Easy way to trigger off a reload of a service managed by Monit without having to become root. In my case, I’ve got a monit service called resque-worker, and I can restart it by just touching tmp/resque-restart.txt.

check file resque-restart.txt with path /path/to/your/app/tmp/resque-restart.txt
  if changed timestamp then
    exec "/usr/bin/monit restart resque-worker"

Ties in nicely with deploy tasks, and you don’t have to end up leaving root access SSH keypairs laying around.

MongoDB, count() and the big O

MongoDB, as I’ve mentioned before, is not without its warts. I’ve run into another, and it’s a nasty one. It turns out that if you perform count() on a query cursor that includes any conditions, even if those conditions are indexed, the operation takes O(n) time to run.

In practice, I’ve found that this costs about 1ms per 1000 records in your counted result set. This is really bad in concert with will_paginate, which Plucky (which is used by MongoMapper) exposes an interface to. It naively takes your query, performs a count() on it, and then performs the query again with limiters to get the records for the current page. This is a standard and quickly-accepted way to do this sort of thing.

NewRelic is a great tool to help profile your applications, and in this case, it’s making the problem abundantly clear:

You see that purple? That’s how long it takes to run those count() operations. What a big fat pile of suck.

I don’t have a good solution to this yet, but in the meantime, I’ve mokneypatched Plucky to cache counts for large result sets. This means that my total counts for a large collection might desync over the course of an hour, but in my use cases, I only need ballpark numbers, so it works out well. This has a very noticeable effect on page times, effectively halving the amount of time I spend in the database for a given index page. Additionally, I can manually specify a count. So, for example, if I know a collection will have over 10k results, I can just pass 10k, and stop paginating after 10k results, drastically reducing my DB load at the expense of exposing older or long-tail content (which may be perfectly, appropriate, depending on the application context).

What I’m doing is caching any counts over some arbitrary limit (I chose 10k, at which point the counts would take ~10ms) for an hour via the Rails cache (memcached, in my case, leveraging the expires_in parameter). I brought the issue up in the #mongodb IRC channel, and the advice I was given was basically “cache your counts”, which is all well and good for simple data sets, but when I’m building pages per-user based on their preferences and myriad inputs (all indexed, mind you), it just doesn’t work, so I’ve resorted to this. It’s a hack, but it’s gotten my page times down substantially.

module Plucky
  class Query
    BIG_RESULT_SET = 10000

    def paginate(opts={})
      page          = opts.delete(:page)
      limit         = opts.delete(:per_page) || per_page
      query         = clone.update(opts)
      cache_key     = "count-cache-#{criteria.source.hash}"
      total         = opts.delete(:total) || Rails.cache.read(cache_key)
      if total.nil?
        total       = query.count
        if total > BIG_RESULT_SET
          Rails.cache.write(cache_key, total, :expires_in => 1.hour)
        end
      end
      paginator     = Pagination::Paginator.new(total, page, limit)
      query[:limit] = paginator.limit
      query[:skip]  = paginator.skip
      query.all.tap do |docs|
        docs.extend(Pagination::Decorator)
        docs.paginator(paginator)
      end
    end
  end
end

I’m not entirely happy with this solution, and would love input on ways to fix it.

Resque and Tests

Resque is a bucket of awesome slathered in a delicious candy coating. It makes background job work really, *really* easy. I recently switched to it, and found that in the process of testing it, I was generating an awful lot of extra unfulfilled jobs in my queue, when the job was a side-effect of some other test (rather than what was being tested explicitly).

I couldn’t find a quick and easy answer to this with some Googling, but it turns out that the answer is fortunately rather simple.
Read More »

MongoDB: Warts and wobbles

I’m a huge fan of MongoDB – after years in MySQL, Interbase, and Postgres SQL databases, it was quite a breath of fresh air to get to try a document database on for size. I’ve more or less adopted it as my default data store for web applications, due to a number of awesome features that many people have enumerated elsewhere. Rather than yet-another post about why MongoDB is great, I figured I’d talk about the things I don’t like in it, the places I’ve had difficulty with it, and the things I’d like to see improve. Knowing the sticky parts of a piece of technology is often as valuable – if not moreso – than knowing what it does really well. I absolutely still recommend it as a data store, but it’s not a magical panacea, and I want to take a realistic view of it.
Read More »

Tarot for easier Rails configurations

Once upon a time, I wrote a quick-and-dirty Rails plugin for site configuration. Since then, I’ve continued to use variants on this pattern, and it’s evolved to the point that it deserved a revisit.

After continually slimming down the code, I realized that even though it’s tiny, it’s danged useful to be able to just drop this into a Rails app and go. Thus, I’d like to present Tarot, my Rails configuration solution.

Tarot’s current form is heavily inspired by the Rails I18n usage, and is very quick and easy to use in your app. The generator installs a sample yaml file at config/tarot.yml, as well as an initializer to bootstrap your configuration and provide a handy helper method for quick access to those config values.

Assuming you have a config file like so:

---
base: &base
  foo: bar
  nested:
    tree: value
  array:
    - value 1
    - value 2

development: &development
  <<: *base

test: &test
  <<: *base

production: &production
  <<: *base
  foo: baz

You’ll notice that all the environments inherit from your base environment; this gives you an easy way to define common settings once, then override them per environment. Handy!

You could can access values by key, or by dot-delimited path:

config('foo') => 'bar'
config('nested.tree') => value

Default values are similarly easy.

config('foo.missing', 42) => 42

Finally, while Tarot will read your current application environment’s config, if you want to reach into another environment, that’s likewise easy:

config('foo', nil, 'production') => 'baz'

As of 0.1.2, Tarot also supports method_missing invocation:

Config = Tarot::Config.new('settings.yml', Rails.env)
Config.foo.bar.baz => "bin"

It also supports default values:

# Assuming foo has no subkey bar
Config.foo.bar("default") => "default"

But it’ll fail if you try to invoke method_missing on a non-leaf node

# Assuming that there is no `blaze` tree
Config.blaze.blarg => NameError

That’s about all there is to it — config isn’t (or shouldn’t be) a hard problem, so there’s not a whole lot to it, but it should get you up and running with easily-configured Rails apps in seconds.

Sexy CSS Scrollbars in Chrome

It’s like it’s 1996 all over again, except with less suck. Webkit now supports styleable scrollbars, and you get to use all the Webkit CSS3 goodies, like gradients and rounded corners and the like. If you’re using Chrome or Safari, you might notice that I have the blog theme rocking super sexy grey scrollbars now, which really ties the whole theme together. It’s pretty easy, too.

Read More »