I recently ran across a rather bare post espousing some generic “optimization” techniques for Rails apps. It offered no education, no explanation, no benchmarks. So, I thought, why not put those claims to the test?
find_by_sql versus find_by_x
First, Konstantin claims that Model#find_by_field is slower than Model#find_by_sql. This one is hard to dispute; the first will invoke method_missing and spend time generating SQL, while the latter simply executes a statement. Is cutting the knees out from under your ORM worth the time saved? Let’s see!
require 'benchmark'
def measure_find_by_sql_vs_orm(num = 1000)
puts "find_by_sql (#{num}x)"
puts Benchmark.measure {
num.times { User.find_by_sql "select * from users where id = 123" }
}
puts "find_by_id (#{num}x)"
puts Benchmark.measure {
num.times { User.find_by_id 123 }
}
end
measure_find_by_sql_vs_orm(10000)
Let’s run this a few times.
[chris@polaris benchmarks]$ script/runner benchmark.rb
find_by_sql (10000x)
2.290000 0.540000 2.830000 ( 4.452150)
find_by_id (10000x)
4.660000 0.400000 5.060000 ( 6.766629)
[chris@polaris benchmarks]$ script/runner benchmark.rb
find_by_sql (10000x)
2.300000 0.480000 2.780000 ( 4.473950)
find_by_id (10000x)
4.520000 0.560000 5.080000 ( 6.837272)
[chris@polaris benchmarks]$ script/runner benchmark.rb
find_by_sql (10000x)
2.170000 0.540000 2.710000 ( 4.419207)
find_by_id (10000x)
4.580000 0.540000 5.120000 ( 6.881676)
find_by_sql: Averages 4.44 sec for 10,000 queries
find_by_id: Averages 6.83 sec for 10,000 queries
Conclusion the first: Using the ORM to build SQL adds some overhead; in my tests, 2.47 sec/10,000 queries, or 0.000247 seconds per query. Is this worth optimizing out? Yeah, probably not. In fact, the productivity lost by using find_by_sql is likely going to end up costing the project more.
IDs and numbers in quotes
Second, they claim that quoting values in your SQL statements slows down your queries. This one struck me as just a little out there. Let’s see what the benchmarks say.
require 'benchmark'
def measure_select_with_quotes(num = 1000)
puts "Without quotes (#{num}x):"
db = ActiveRecord::Base.connection.instance_variable_get :@connection
puts Benchmark.measure {
num.times { db.query("select * from users where id = 123") {} }
}
puts "With quotes (#{num}x):"
puts Benchmark.measure {
num.times { db.query("select * from users where id = \"123\"") {} }
}
end
measure_select_with_quotes(10000)
And the results:
[chris@polaris benchmarks]$ script/runner benchmark.rb
Without quotes (10000x):
0.690000 0.340000 1.030000 ( 2.639554)
With quotes (10000x):
0.670000 0.290000 0.960000 ( 2.655049)
[chris@polaris benchmarks]$ script/runner benchmark.rb
Without quotes (10000x):
0.570000 0.320000 0.890000 ( 2.654003)
With quotes (10000x):
0.550000 0.400000 0.950000 ( 2.617369)
Well, that’s certainly interesting. In 10,000 queries, an average difference of about 3/100ths of a second. Certainly not worth combing through your codebase as an optimization point.
Conclusion the second: The performance gain from quoted versus non-quoted field values is so small to be inconsequential.
On a side note, there is a very interesting subtlety here. Observe the difference between
num.times { db.query("select * from users where id = 123") {} }
and
num.times { db.query("select * from users where id = 123") }
The former passes the Mysql::Result object to a block, and frees it after the block terminates. The latter does not, and the returned Mysql::Result object remains in scope for the entire pass of the benchmark. This subtlety makes a massive difference.
def measure_select_with_free(num = 1000)
db = ActiveRecord::Base.connection.instance_variable_get :@connection
puts "Query with block, result immediately freed"
puts Benchmark.measure {
num.times { db.query("select * from users where id = 123") {} }
}
puts "Query without block, result remains in scope"
puts Benchmark.measure {
num.times { db.query("select * from users where id = 123") }
}
end
[chris@polaris benchmarks]$ script/runner benchmark.rb
Query with block, result immediately freed
0.060000 0.040000 0.100000 ( 0.267983)
Query without block, result remains in scope
5.040000 0.050000 5.090000 ( 5.266476)
Whoa damn. Ruby’s GC is slaughtering performance there. Just adding a pair of curly braces makes the benchmark run 20 times faster.
It’s better to request only specific column
Finally, Konstantin mentions that selecting only specific fields from a table is faster. This is a truth in both MySQL and in the ActiveRecord ORM, for a number of reasons. However, he says that
Person.find_by_name(“Name”).phone_number. It would be much faster if you use: Person.find_by_sql(“SELECT persons.phone_number WHERE persons.name = ‘Name’”)
Why not just use the :select option that ActiveRecord provides?
Person.find_by_name("Name", :select => "phone_number")
Let’s test those assumptions.
def measure_single_field_select(num = 1000)
puts "Find with all fields"
puts Benchmark.measure {
num.times { User.find_by_id(123)}
}
puts "Find with one field, with :select"
puts Benchmark.measure {
num.times { User.find_by_id(123, :select => "email")}
}
end
[chris@polaris benchmarks]$ script/runner benchmark.rb
Find with all fields
0.720000 0.060000 0.780000 ( 0.963273)
Find with one field, with :select
0.310000 0.010000 0.320000 ( 0.364554)
[chris@polaris benchmarks]$ script/runner benchmark.rb
Find with all fields
0.710000 0.110000 0.820000 ( 1.014548)
Find with one field, with :select
0.260000 0.020000 0.280000 ( 0.351761)
Very significant difference there…and we didn’t have to bypass the ORM to get it, either.
10 Comments
Hi there,
In the quotes vs. non-quotes your experimental setup is exactly wrong. You print "without quotes" next to the one that has quotes and vice versa. :) Also, unless you've specifically disabled it, you're likely running all your queries out of MySQL's query cache. To disable, put /* SQL_NO_CACHE */ somewhere in the query (after restarting your db to clear any existing cache) or just set system variable query_cache_type to OFF.
One more comment: the quotes/non-quotes issue is VERY real, and works both ways, strings as ints, and ints as strings. But to really see it, you need to be working on a largish table with an index on the column in question, because what's going on is that when you pass the wrong type of data, it screws with the index utilization. If you pass a string to an indexed int column, rather than convert the value you passed to an int, it converts EVERY VALUE IN THAT COLUMN to a string, which triggers a table scan.
There are also subtle bugs that come in, such as 64-bit overflow when converting strings on 32-bit hardware, etc. In any case, it's just a good idea to have your data types matching.
That's an interesting observation; I would have expected the database to perform a cast of the passed value to the expected data type of the field before passing it to the index. I've been doing development against SQL databases for nearly a decade, and have never run into that one.
I'll generate and run some tests against some tables. Is there a good reason that the database wouldn't cast the value to the column's data type?
Actually, I should caveat that. Passing strings to ints isn't as bad as passing ints to strings, depending on how MySQL is set up. But we've definitely been bitten many many times by weird, difficult-to-reproduce bugs when passing strings as ids.
Well, the more I think about it, the more I'm on crack wrt the performance implications of passing strings into int columns. But the other way around is definitely true, even though the query will work just fine. We saw this with zipcodes: to tell if SELECT * FROM zipcodes WHERE code = 500 will match a row with code = '00500' it has to actually convert the string, according to its match semantics.
Generated a table with 5,000,000 rows, consisting of an id field and a value field; id field is a primary key, type INT.
The benchmarking script (with corrected labels!) then produces the following:
Run #1:
<pre><code>Without quotes (10000x):
0.090000 0.210000 0.300000 ( 0.929264)
With quotes (10000x):
0.120000 0.220000 0.340000 ( 0.916973)</code></pre>
Run #2:
<pre><code>Without quotes (10000x):
0.120000 0.140000 0.260000 ( 0.903709)
With quotes (10000x):
0.140000 0.170000 0.310000 ( 0.951744)</code></pre>
The zip code thing makes perfect sense, though – I absolutely agree that when you're querying against a string column type, make sure that you're passing a string! The example the original author had, though, was passing an ID in a query as as a string, which observationally just doesn't decrease performance to any significant degree.
OK, that's better. :) Thanks for the benchmarks…
Doh! Stupid me on the quote labels there. I’ve corrected it – thanks. :)
These were run against my development box, though, which does have the MySQL query cache turned off so I can more easily track down errant queries and the like.
I was wondering about what the difference would be for #1 if you replaced User.find_by_sql "select * from users where id = 123" or User.find_by_id 123 with
User.find(:all, :conditions => ["id = ?", 123])
It would seem that the find_by_id adds missing_method overhead, while the find(:all) should bypass that, and give you something a bit more similar to the find_by_sql.
Thanks.
JGeiger:
Here's what I came up with. Not too surprising. :)
<pre><code>
[chris@polaris benchmark]$ script/runner benchmark.rb
find_by_sql (10000x)
2.430000 0.580000 3.010000 ( 4.457347)
find_by_id (10000x)
4.520000 0.490000 5.010000 ( 6.620790)
find with conditions (10000x)
3.740000 0.530000 4.270000 ( 5.881615)
</code></pre>