.

Coffee Powered

code and content

How to sign your Ruby gems

In light of the recent Rubygems security issues, I’ve been adding signatures to my own gems, and encouraging other gem authors to do the same by opening issues on various Github projects. Gem signing coupled with publication of a pubkey allows people to verify the authenticity of your published gems against your repository, so that they can be certain that the gems they are downloading from Rubygems (or where ever) are authentic and were actually released by you, the gem author (as opposed to, say, backdoored and uploaded to Rubygems by a malicious entity in the event of another security breach).

The how-to is here: http://docs.rubygems.org/read/chapter/21

TL;DR:

  1. gem cert --build your@email.com
  2. Copy the private key somewhere safe (I use ~/.gemcert)
  3. Add the public key to the repo (git add gem-public_cert.pem)
  4. Update the gemspec with something like:
    s.signing_key = '/home/chris/.gemcert/gem-private_key.pem'
    s.cert_chain  = ['gem-public_cert.pem']
    
  5. Push and rake release

While this does mean that your gem is signed by a self-signed certificate, and thus the lack of chain-of-trust means that your gem would not be verifiable if your Github credentials or machine housing your private key were compromised, it provides a layer of verification between source and package publication platforms, and would allow for much speedier community recovery in the event of a future breach.

This is quick, easy, and has no downside. I encourage all gem authors to immediately add signatures to their gems, and for all gem users to open or support issues on your favorite gem projects to encourage their maintainers to do the same.

Profiling RSpec 2 Examples

Tests can be slow. This is how to find out why they’re slow.

Toss this bad boy into spec/support/profile.rb and tag any example with :profile => true and it’ll spit out callgrind dumps for your consumption in KCachegrind or similar.

If you specify PROFILE=all on your command line, it’ll profile *all* examples, regardless of tagging. If you pass PROFILE=true (or any other non-nil, non-ALL value) then it’ll profile tagged examples.

Bam.

Don’t use strip_tags.

I ran into a rather surprising little problem earlier this week that I felt bore documenting. It turns out that the “simple” Rails strip_tags helper is massive overkill when you just want to strip markup out of a document. It offers a lot of functionality, but it comes at a pretty ugly performance cost.

Here’s the call graph for #strip_tags (as profiled in an application I’m working on). As you can see, it tokenizes the entire string, and then iterates the tokens, likely pushing and popping sections onto and off of a stack as tags are opened and closed.

This is a lot more than a quick little regex to strip out tags; it’s actually parsing the full HTML document. Fortunately, there are already tools to do that, and they have their slow parts written as C extensions. Nokogiri is my weapon of choice in this regard – it’s battle-tested and generally rocks at parsing markup, even when it’s poorly-formed.

So, let’s benchmark a “strip all the markup out of a string” use case with #strip_tags and nokogiri.

require 'rubygems'
require 'action_view'
require 'nokogiri'

include ActionView::Helpers::SanitizeHelper

f = open("news").read

LOOPS = 1000
Benchmark.bmbm do |x|
  x.report("strip_tags") { LOOPS.times { strip_tags f }}
  x.report("nokogiri") { LOOPS.times { Nokogiri::HTML(f).text }}
end

The data file in this case is a snapshot of the current page of Hacker News. It’s a 23kb HTML file. Nothing too huge, but certainly not small, either. Let’s run it through the benchmark:

[chris@luna projects]$ ruby strip.rb
Rehearsal ----------------------------------------------
strip_tags  33.070000   0.010000  33.080000 ( 33.092638)
nokogiri     3.220000   0.020000   3.240000 (  3.241090)
------------------------------------ total: 36.320000sec

                 user     system      total        real
strip_tags  33.010000   0.010000  33.020000 ( 33.056551)
nokogiri     3.190000   0.000000   3.190000 (  3.200680)

Yikes. It’s not just slower, it’s ~10x slower.

Don’t use strip_tags. Also, profile your code. But just because it’s convenient doesn’t mean you should use it.

Enabling brightness controls on an HP Envy 17 under Fedora 16

I’ve recently set up Fedora 16 on my laptop, and all has been smooth, save for the brightness switches. The on-screen display would show up when I used the fn-F2/fn-F3 key combinations, but the brightness just wouldn’t change. Additionally, the brightness was stuck at the lowest level.

Turns out there’s a pretty easy fix in the form of a couple of module parameters:

In /etc/defaults/grub, add the following kernel parameters:

video.brightness_switch_enabled=1 video.use_bios_initial_backlight=0

(You may also want to add radeon.modeset=1 and acpi_osi=Linux for this particular machine, but they aren’t related to the brightness fix.)

Then update your grub2 config:

 grub2-mkconfig > /boot/grub2/grub.cfg 

Reboot, and your brightness controls should work as expected. The brightness slider in GNOME still doesn’t work, but I’m content with hardware brightness controls over no brightness controls.

Comps – design vs reality comparisons in Chrome

For a long time, I used the PixelPerfect Firefox add-on to compare rendered comps with my finished web work. This was a fast and effective way to make sure that I got the spacings, font sizes, and other such things done properly.

However, PixelPerfect doesn’t work all that well (well, at all) anymore, and Firefox is no longer my browser of choice. There weren’t any good options for Chrome, so I wrote one.

You can grab it straight from the Chrome store if you want (it’s free!), or you might be interested in perusing the source code.

Comps is very lightweight, and gets straight to the point – click the button, drop your comp into the webpage, and position it to make direct comparisons. Mousewheel over the comp to change its opacity, or you can toggle it on and off using the thumbnail drawer or the Comps button. It’ll remember your settings between sessions (and even pageloads!) so that you can tweak and refresh to your heart’s content, and instantly have your comp right there for comparison.

Give it a shot, let me know if/how you like it.

Rails Cookie Sessions and PHP

I recently found myself needing to share session data from my Rails app with a PHP app on the same domain. We use cookie sessions for a number of reasons, and while they work great, the data stored in them is stored in Ruby’s native Marshal format, which is not trivial to reimplement in PHP. After trying to get the data unmarshaled for a bit, I had another idea – why not just change the storage format?

Fortunately, Ruby is deeply entangled with another more portable serialization format: YAML.

Rails manages its session cookies through the MessageVerifier. Easy enough – we can just write our own MessageVerifier that uses YAML rather than Marshal.


module ActiveSupport
  class YamlMessageVerifier < MessageVerifier
    def verify(signed_message)
      raise InvalidSignature if signed_message.blank?

      data, digest = signed_message.split("--")
      if data.present? && digest.present? && secure_compare(digest, generate_digest(data))
        str = ActiveSupport::Base64.decode64(data)
        if str[0..2] == '---'
          YAML::load str
        else # Handle old Marshal.dump'd session
          Marshal.load(str)
        end
      else
        raise InvalidSignature
      end
    end

    def generate(value)
      data = ActiveSupport::Base64.encode64s(YAML::dump value)
      "#{data}--#{generate_digest(data)}"
    end
  end
end

You’ll notice that verify() can accept a Marshaled session as well; this lets you transparently transition existing cookies to the new format without any kind of session breakage. Nice.

Now, to use the verifier, we monkeypatch CookieStore:

module ActionController
  module Session
    class CookieStore
      def verifier_for(secret, digest)
        key = secret.respond_to?(:call) ? secret.call : secret
        ActiveSupport::YamlMessageVerifier.new(key, digest)
      end
    end
  end
end

Now, this will work…at least at first glance, until you try to use the flash. This is a particularly nasty little problem, and it stems from the fact that Ruby’s YAML implementation serializes Hash objects without their instance variables, and FlashHash inherits from Hash, and thus inherits its serialization/deserialization strategy. I worked for a while to monkeypatch those strategies, but I didn’t like the result, and it felt a little hacky. Instead, I just took advantage of the YAML load lifecycle to make sure the FlashHash initializes properly:

module ActionController
  module Flash
    class FlashHash
      def update_with_initializer(h)
        @used ||= {}
        update_without_initializer(h)
      end
      alias_method_chain :update, :initializer
    end
  end
end

The core problem is that YAML::load calls Hash#update, and FlashHash presumes that the @used instance variable is present and initialized to an empty hash. To fix that, I just aliased in an initializer to make sure that variable is set.

Note that if you are storing other Hash subclasses with instance variables that rely on those variables being persisted across sessions, they will break. However, you should only be storing primitive/array/hash data in the session if possible. FlashHash is sort of a nasty violation of this principle.

At this point, your session should be serializing to and from YAML. We’ll want to read it from PHP, naturally. I’m using SPYC in the PHP project, which gets us Close Enough(TM). It doesn’t handle symbol keys, but we’ll handle those in the PHP itself.

Reading from PHP

Reading the data back out is surprisingly simple. We have to verify the authenticity of the data, of course, by checking the hash, but then you just base64 decode the data, load it with spyc, and perform some simple transformation to turn symbols into strings. If you wanted to make it even easier, you could monkeypatch the cookie store to call #stringify_keys! on your session hash before serializing it (and then call #with_indifferent_access on the hash when you deserialize it. Be aware of the speed impact of such a decision before you do it.)

function explode_symbols($arr) {
  $result = array();
  foreach($arr as $key => $val) {
    if(is_numeric($key) && $val[0] == ":") {
      $bits = explode(":", $val, 3);
      $result[trim($bits[1])] = trim($bits[2]);
    } elseif (is_array($val)) {
      $result[$key] = explode_symbols($val);
    } else {
      $result[$key] = $val;
    }
  }
  return $result;
}

function deserialize_session($session_key, $secret) {
  list($session64, $hash) = explode("--", $_COOKIE[$session_key], 2);
  if(hash_hmac("SHA1", $session64, $secret) == $hash) {
    $session = base64_decode($session64);
    return explode_symbols(spyc_load($session));
  } else {
    throw new Exception("Invalid session signature");
  }
}

$rails_session = deserialize_session("your_session_cookie_name", $your_session_cookie_secret);

Caveats

  • Be aware that YAML is slower than Marshal
  • Be aware that storing Hash subclasses in the session is likely going to Not Work.

And that’s all there is to it. You can now share data between the two apps via the session cookie.

Restarting Resque workers (or anything, really) with Monit, Passengers-style.

Easy way to trigger off a reload of a service managed by Monit without having to become root. In my case, I’ve got a monit service called resque-worker, and I can restart it by just touching tmp/resque-restart.txt.

check file resque-restart.txt with path /path/to/your/app/tmp/resque-restart.txt
  if changed timestamp then
    exec "/usr/bin/monit restart resque-worker"

Ties in nicely with deploy tasks, and you don’t have to end up leaving root access SSH keypairs laying around.

MongoDB, count() and the big O

MongoDB, as I’ve mentioned before, is not without its warts. I’ve run into another, and it’s a nasty one. It turns out that if you perform count() on a query cursor that includes any conditions, even if those conditions are indexed, the operation takes O(n) time to run.

In practice, I’ve found that this costs about 1ms per 1000 records in your counted result set. This is really bad in concert with will_paginate, which Plucky (which is used by MongoMapper) exposes an interface to. It naively takes your query, performs a count() on it, and then performs the query again with limiters to get the records for the current page. This is a standard and quickly-accepted way to do this sort of thing.

NewRelic is a great tool to help profile your applications, and in this case, it’s making the problem abundantly clear:

You see that purple? That’s how long it takes to run those count() operations. What a big fat pile of suck.

I don’t have a good solution to this yet, but in the meantime, I’ve mokneypatched Plucky to cache counts for large result sets. This means that my total counts for a large collection might desync over the course of an hour, but in my use cases, I only need ballpark numbers, so it works out well. This has a very noticeable effect on page times, effectively halving the amount of time I spend in the database for a given index page. Additionally, I can manually specify a count. So, for example, if I know a collection will have over 10k results, I can just pass 10k, and stop paginating after 10k results, drastically reducing my DB load at the expense of exposing older or long-tail content (which may be perfectly, appropriate, depending on the application context).

What I’m doing is caching any counts over some arbitrary limit (I chose 10k, at which point the counts would take ~10ms) for an hour via the Rails cache (memcached, in my case, leveraging the expires_in parameter). I brought the issue up in the #mongodb IRC channel, and the advice I was given was basically “cache your counts”, which is all well and good for simple data sets, but when I’m building pages per-user based on their preferences and myriad inputs (all indexed, mind you), it just doesn’t work, so I’ve resorted to this. It’s a hack, but it’s gotten my page times down substantially.

module Plucky
  class Query
    BIG_RESULT_SET = 10000

    def paginate(opts={})
      page          = opts.delete(:page)
      limit         = opts.delete(:per_page) || per_page
      query         = clone.update(opts)
      cache_key     = "count-cache-#{criteria.source.hash}"
      total         = opts.delete(:total) || Rails.cache.read(cache_key)
      if total.nil?
        total       = query.count
        if total > BIG_RESULT_SET
          Rails.cache.write(cache_key, total, :expires_in => 1.hour)
        end
      end
      paginator     = Pagination::Paginator.new(total, page, limit)
      query[:limit] = paginator.limit
      query[:skip]  = paginator.skip
      query.all.tap do |docs|
        docs.extend(Pagination::Decorator)
        docs.paginator(paginator)
      end
    end
  end
end

I’m not entirely happy with this solution, and would love input on ways to fix it.

Resque and Tests

Resque is a bucket of awesome slathered in a delicious candy coating. It makes background job work really, *really* easy. I recently switched to it, and found that in the process of testing it, I was generating an awful lot of extra unfulfilled jobs in my queue, when the job was a side-effect of some other test (rather than what was being tested explicitly).

I couldn’t find a quick and easy answer to this with some Googling, but it turns out that the answer is fortunately rather simple.
Read More »

MongoDB: Warts and wobbles

I’m a huge fan of MongoDB – after years in MySQL, Interbase, and Postgres SQL databases, it was quite a breath of fresh air to get to try a document database on for size. I’ve more or less adopted it as my default data store for web applications, due to a number of awesome features that many people have enumerated elsewhere. Rather than yet-another post about why MongoDB is great, I figured I’d talk about the things I don’t like in it, the places I’ve had difficulty with it, and the things I’d like to see improve. Knowing the sticky parts of a piece of technology is often as valuable – if not moreso – than knowing what it does really well. I absolutely still recommend it as a data store, but it’s not a magical panacea, and I want to take a realistic view of it.
Read More »