.

Coffee Powered

code and content

MongoDB, count() and the big O

MongoDB, as I’ve mentioned before, is not without its warts. I’ve run into another, and it’s a nasty one. It turns out that if you perform count() on a query cursor that includes any conditions, even if those conditions are indexed, the operation takes O(n) time to run.

In practice, I’ve found that this costs about 1ms per 1000 records in your counted result set. This is really bad in concert with will_paginate, which Plucky (which is used by MongoMapper) exposes an interface to. It naively takes your query, performs a count() on it, and then performs the query again with limiters to get the records for the current page. This is a standard and quickly-accepted way to do this sort of thing.

NewRelic is a great tool to help profile your applications, and in this case, it’s making the problem abundantly clear:

You see that purple? That’s how long it takes to run those count() operations. What a big fat pile of suck.

I don’t have a good solution to this yet, but in the meantime, I’ve mokneypatched Plucky to cache counts for large result sets. This means that my total counts for a large collection might desync over the course of an hour, but in my use cases, I only need ballpark numbers, so it works out well. This has a very noticeable effect on page times, effectively halving the amount of time I spend in the database for a given index page. Additionally, I can manually specify a count. So, for example, if I know a collection will have over 10k results, I can just pass 10k, and stop paginating after 10k results, drastically reducing my DB load at the expense of exposing older or long-tail content (which may be perfectly, appropriate, depending on the application context).

What I’m doing is caching any counts over some arbitrary limit (I chose 10k, at which point the counts would take ~10ms) for an hour via the Rails cache (memcached, in my case, leveraging the expires_in parameter). I brought the issue up in the #mongodb IRC channel, and the advice I was given was basically “cache your counts”, which is all well and good for simple data sets, but when I’m building pages per-user based on their preferences and myriad inputs (all indexed, mind you), it just doesn’t work, so I’ve resorted to this. It’s a hack, but it’s gotten my page times down substantially.

module Plucky
  class Query
    BIG_RESULT_SET = 10000

    def paginate(opts={})
      page          = opts.delete(:page)
      limit         = opts.delete(:per_page) || per_page
      query         = clone.update(opts)
      cache_key     = "count-cache-#{criteria.source.hash}"
      total         = opts.delete(:total) || Rails.cache.read(cache_key)
      if total.nil?
        total       = query.count
        if total > BIG_RESULT_SET
          Rails.cache.write(cache_key, total, :expires_in => 1.hour)
        end
      end
      paginator     = Pagination::Paginator.new(total, page, limit)
      query[:limit] = paginator.limit
      query[:skip]  = paginator.skip
      query.all.tap do |docs|
        docs.extend(Pagination::Decorator)
        docs.paginator(paginator)
      end
    end
  end
end

I’m not entirely happy with this solution, and would love input on ways to fix it.

  • http://rendion.myopenid.com/ render

    calling count() and caching counts is an idiots way of showing how they have no clue about transactions.  Doing either of these things is for idiots.

  • Foljssk

    Thank you, Einstein.

    For a “super smart person” you sure have very low communication skills. 

  • Mason

    Pretty much. In addition, good programming style would be to implement a simple reusable application level cache and use it where you need it, rather than hiding a very tricky number-fudger deep in the code base.

  • Adam

    I had the same problem with a dataset of 1m+ records in mongo and really slow pagination with will_paginate. I have since swapped it out with the ‘kaminari’ gem (https://github.com/amatsuda/kaminari). This should solve your problem and is *almost* as simple as will_paginate to drop in.

  • Jack

    Wat? How are you going to display the number of pages without knowing the size of the dataset? What on earth do transactions have to do with this?  

  • Catalin

    You can fix the “count” by doing map reduce. Map reduce uses the indexes. Not an easy solution, but might work :)

  • http://wixel.net Sean Nieuwoudt

    How would you do it then? Please enlighten us. 

  • http://coffeepowered.net Chris Heald

    map/reduce will still scan over everything during the map phase. It’s a solid strategy for things that just need their counts updated periodically, but it’s not viable for dynamic queries based on user input – think things like partial text searches.

  • http://addictedtonew.com jnunemaker

    I actually don’t use the paginate without count either and I wrote Plucky. :) 

    I do something like this in one app:
    https://gist.github.com/4541827

    I also just merged a pull request that allows you to inject the total entries for plucky.
    https://github.com/jnunemaker/plucky/pull/31

  • http://coffeepowered.net Chris Heald

    Excellent! Onwards and upwards. :D

  • http://twitter.com/kcantsay K Can’t Say

    [Removed due to wrong account]

  • Kyle VanderBeek

    I resemble that pull request. Thanks all!