Comparing the performance of three different approaches to a dynamic bulk update ran on ~150.000 records.
The following example is a replication of a bulk action problem solved in the Active Record way and the raw SQL way. I'm going to go through three different solutions and compare their performance. For each solution, I'll share the code, run the code on my machine (development environment) and register how long each one takes to successfully update ~150.000 records.
Let's then consider the following db schema and models:
class PlayerGameStatistic < ApplicationRecord
belongs_to :player
belongs_to :team
belongs_to :game
end
class Player < ApplicationRecord
belongs_to :team
has_many :player_game_statistics
enum status: %i[active inactive retired]
end
class Team < ApplicationRecord
belongs_to :team
has_many :player_game_statistics
has_many :games
enum status: %i[inactive active]
end
We have 3 tables: players
, teams
and player_game_statistics
. I'm purposefully excluding the games table here (to which the player_game_statistics
belongs too), since it is not relevant for this exercise.
They are all long-existing tables; they all have thousands of rows but player_game_statistics
is, by its nature, the one that carries more rows.
Now, let's say that I have added just added a new column player_name
to player_game_statistics
. For the existing records that have this column empty (defaults to null
), I want to populate it with data that is available on the name
column of the players
table.
To reduce the number of rows a bit, I'm actually only interested in updating the statistics of active and injured players (not including retired players) and active teams (these are values of the status attribute that both players and teams have).
Version 1: The Active Record way
A first approach could be to iterate each player_game_statistic
, get the associated player
, and update the statistics' player_name
column based on the player.name
value.
This will result in the classic n + 1
query. We will load the player_game_statistics_to_update
and then do n
additional queries to grab each player's name.
This is where :includes
usually comes in. So let's add that.
class UpdatePlayerGameStatisticsV1
def execute!
player_game_statistics_to_update.each do |stat|
stat.update!(player_name: stat.player.name)
end
end
private
def user_statistics_to_update
PlayerGameStatistic
.includes(:player)
.joins(:team)
.references(:player)
.where('team.status = 1')
.where('player.status IN (0, 1)')
end
end
:includes
will either use :preload
or :eager_load
depending on whether or not you reference the association being loaded on a subquery (such as a where clause). In our case, we want to do exactly that so that we can filter results by player status. This will result in a single query on the player_game_statistics
with a left outer join on players
and an inner join on teams
. :eager_load
is slower than :preload
but it's not like we have an option here, and it's still better than not using :includes
at all. For more details on how :includes
work, I recommend this great article by Julianna Roen on the gusto blog.
Result: It took about 00:04:33 to update ~150.000 rows.
Version 2: Active Record with a memoized dictionary
A second approach I was interested in measuring is a cached dictionary approach where you query all the players, pluck the player_id
and the player_name
, transform that result into a hash that will then be memoized.
We still iterate over each statistic but instead of getting the player through the statistic we query the players once, and the following iterations will just consult the dictionary. This approach results in an extra query compared to Version 1 but I'm assuming it should not be that heavy since I'm using :pluck
.
class UpdatePlayerGameStatisticsV2
def execute!
player_game_statistics_to_update.each do |stat|
stat.update!(player_name: player_name_idx[stat.player_id])
end
end
private
def player_game_statistics_to_update
PlayerGameStatistic.where(player_id: player_name_idx.keys)
end
def player_name_idx
@player_name_idx ||=
Player
.joins(:team)
.where('teams.status = 1')
.where("players.status IN (0,1)")
.pluck('players.id, players.name')
.to_h
end
end
Result: It took about 00:04:58 to update ~150.000 rows, not a relevant difference from Version 1.
Version 3: raw SQL
A final approach is to use raw SQL. We will use the Postgres UPDATE statement to update the statistics in a single query.
class UpdatePlayerGameStatisticsV3
def execute!
ActiveRecord::Base.connection.execute(Arel.sql(update_sql))
end
private
def update_sql
<<-SQL
UPDATE player_game_statistics AS stats
SET player_name = players.name,
updated_at = LOCALTIMESTAMP
FROM (#{players_sql}) AS players
WHERE stats.player_id = players.id
SQL
end
def players_sql
<<-SQL
SELECT players.id, players.name
FROM players
LEFT OUTER JOIN teams ON teams.id = players.team_id
WHERE teams.status = 1 AND players.status IN (0,1)
SQL
end
end
Result: It took about 00:00:11 to update ~150.000 rows.
Benchmarking all 3 versions:
Again looking at the real-time, v1 seems to do a bit better than v2, although the difference might not be that relevant. But v3 is a clear winner here as it's about 28 times faster than the others.
[#<Benchmark::Tms:0x00007fe48268deb8
@cstime=0.0,
@cutime=0.0,
@label="v1",
@real=309.4179189999704,
@stime=15.808668,
@total=241.84425000000002,
@utime=226.035582>,
#<Benchmark::Tms:0x00007fe47ee8eda0
@cstime=0.0,
@cutime=0.0,
@label="v2",
@real=341.4523780000163,
@stime=14.207616000000002,
@total=231.90784299999999,
@utime=217.70022699999998>,
#<Benchmark::Tms:0x00007fe48073ef08
@cstime=0.0,
@cutime=0.0,
@label="v3",
@real=12.004358999955002,
@stime=0.001827999999996166,
@total=0.003549999999968634,
@utime=0.0017219999999724678>]
Conclusion:
I could have probably explored other Active Record solutions and it would be really interesting to see how other developers are solving these problems the Rails way. But in my experience, so far, Active Record shows some performance bottlenecks in complex and data-heavy bulk updates/inserts. That is where SQL comes in handy.
Though the memoized dictionary (v2) did not do better than the full Active Record way (v1), it does not mean that in other cases it won't help improve performance. I often experience cases where it does help a lot.
Finally, I find it important to run this kind of analysis on different problems/queries. Each case is a case. Again, from my experience, SQL ends up winning for high amounts of data and/or complex updates.
Performance is one of my favorite topics and I'll be writing more about it here, on my blog and on twitter. I'd be happy to discuss performance issues and solutions with other people so feel free to visit or DM me.
Top comments (1)
Is there a reason you aren't considering
update_all
?eg. something like
PlayerGameStatistic
.joins(:player)
.joins(:team)
.where('team.status = 1')
.where('player.status IN (0, 1)')
.update_all('name = players.name FROM players')
You can do
joins(:player)
instead ofincludes
&references
because you havebelongs_to :player
not set to optional (eg. you would gain nothing by doing an outer/right join because ActiveRecord is already presuming they will always both exist). (eager loading via includes may not be necessary here since you're using it as an inner query to feed an outer update_all query). Verify with appending.to_sql
but it should work.If you have a lot of records, you can also do it via
in_batches
:PlayerGameStatistic
.joins(:player)
.joins(:team)
.where('team.status = 1')
.where('player.status IN (0, 1)')
.in_batch(of: 1000) do |batch|
batch.update_all('name = players.name FROM players')
end
cheers!