2 minute read

Make sure your rewrite/migration is fixing the problem, not simply moving it.

Perception

One of the teams I worked with was pouring a great deal of effort into moving slow background processing from Ruby/Sidekiq to JavaScript/Lambda, and Ruby was considered “legacy” by much of the team. The prevailing perception was that JavaScript was inherently faster than Ruby, Lambda was far superior to Sidekiq, and anything ported between them would automatically be faster.

Soon after joining the team, I made a database change that caused a Ruby unit test to fail. It was an easy fix, and I included it in the changes I submitted for code review. One of the reviewers commented that the code I fixed was obsolete and should be removed. Out of curiosity, I took another look at the Ruby code—and the JavaScript that replaced it.

I didn’t like what I found. Here’s the Ruby process in a nutshell:

  • Query ~100,000 rows from table A in chunks of 1,000 using LIMIT and OFFSET
  • Break each chunk of 1,000 into smaller chunks of 10 for insertion into table B

This didn’t happen in nested loops. Instead:

  • The top-level routine read each chunk of 1,000 and queued it for level 2 using Sidekiq
  • The level 2 routine broke it into chunks of 10 and queued those for level 3 using Sidekiq
  • The level 3 (and final) routine inserted each chunk of 10 into table B using a single INSERT statement

The data was modified during the process, so it couldn’t be a direct table-to-table copy. But if you follow the convoluted process above and do the math, you’ll see:

  • Table A is queried 100 times
  • Table B is written to 10,000 times
  • Sidekiq handles 10,101 invocations across 3 routines

As I was going to St Ives,
I met a man with seven wives,
Each wife had seven sacks,
Each sack had seven cats,
Each cat had seven kits:
Kits, cats, sacks, and wives,
How many were there going to St Ives?

The new JavaScript code did much the same thing, except:

  • It substituted Lambda for Sidekiq
  • It introduced a data error because the developer misunderstood a bit of Ruby logic

Reality

Ruby vs. JavaScript is out of scope for this post, and Lambda definitely had an edge here simply due to its higher concurrency.

HOWEVER, the real performance limit was the database access pattern.

Here’s a better approach:

  • Create a temporary table matching table B
  • Copy the 100,000 rows from table A to the temp table using INSERT ... SELECT
  • Modify the data using UPDATE
  • Insert the rows into table B using another INSERT ... SELECT
  • Drop the temporary table

The entire process only needed 5–7 queries in a single invocation. And even though the runtime of the single job was significant, it still ran ~3× faster than the JavaScript/Lambda implementation—largely because the data never left the database. As a bonus, the entire task was now atomic.

Oh, and we moved the job back to legacy Ruby/Sidekiq—because consolidating 10,101 invocations into one caused timeouts in Lambda.

Updated: