We've shipped Rails apps where the database goes from 5 GB to 500 GB and from 50 requests per second to 5,000. The migrations that worked fine in development at 5 GB will lock the table for 8 minutes at 500 GB. Every Rails team learns this once, usually at 2am, usually during a launch week.
This post is the playbook we use on every production Rails project. It assumes Postgres 11+ (the rules are different on older versions) and Rails 6.1+, where the migration DSL has the helpers we need.
What is a zero-downtime Rails migration? (the short version)
A zero-downtime Rails migration is a schema change that runs against a live production database without locking tables long enough to time out incoming requests. On Postgres, this usually means avoiding operations that take an ACCESS EXCLUSIVE lock for more than a few hundred milliseconds. The strong_migrations gem enforces this at the migration-file level.
Watch first: how Postgres locks actually work
Before the rules, you need to understand the lock model. Andrew Atkinson's RailsConf 2022 talk walks through what happens in Postgres when Rails issues ALTER TABLE — the lock acquisition, the queue, and the cascade that takes down read traffic. The 30 minutes you spend on this video saves you from the next incident.
Why most Rails migrations cause downtime
Three categories. Almost every downtime incident we've debugged falls into one:
- Long table rewrites. Adding a column with a non-static default, changing a column type, or removing a column with constraints can trigger Postgres to rewrite the entire table. On a 500M-row table this is measured in minutes, not seconds.
- Lock contention with running queries.
ALTER TABLEneeds anACCESS EXCLUSIVElock. If even one long-running query is reading the table, the migration queues. Every subsequent query also queues behind the migration. Five seconds of waiting becomes five minutes of cascading timeouts. - App/DB schema mismatch during deploy. A column rename means the old app version is reading column X, the new app version is reading column Y, and during the rolling deploy both are live simultaneously. The half-deployed app crashes on every request that hits the wrong instance.
Install strong_migrations first, not later
Andrew Kane's strong_migrations gem is the single highest-leverage thing you can add to a Rails app. It detects unsafe operations at migration write time and tells you what to do instead. Install it on the first migration of every project, before anyone has had a chance to ship a bad pattern.
# Gemfile
gem "strong_migrations"
# config/initializers/strong_migrations.rb
StrongMigrations.start_after = 20260101000000 # don't lint historical migrations
StrongMigrations.statement_timeout = 1.hour # safety net
StrongMigrations.lock_timeout = 10.seconds # bail rather than queue
StrongMigrations.target_postgresql_version = "15"
The lock_timeout is the most important setting. Without it, an ALTER TABLE behind a long-running query will queue indefinitely and take down read traffic. With it, the migration fails fast and the deploy aborts — much better outcome.
The seven unsafe operations (and their safe alternatives)
1. Adding a NOT NULL column with a default to a large table
The classic killer. add_column :users, :country, :string, null: false, default: "US" looks innocent. On a 50M-row users table, it rewrites every row, takes an ACCESS EXCLUSIVE lock for the duration, and blocks all reads and writes.
The fix on Postgres 11+ is that static defaults are metadata-only — Postgres doesn't rewrite the table. But Rails wraps it in a transaction that still acquires the lock. The safe pattern is to split it across multiple deploys (more on that below).
2. Creating an index without CONCURRENTLY
add_index :users, :email takes a SHARE lock that blocks writes for the duration of the build. On a large table this is minutes. The fix:
class AddIndexToUsersEmail < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def change
add_index :users, :email, algorithm: :concurrently
end
end
Two changes from the default: disable_ddl_transaction! because CONCURRENTLY can't run in a transaction, and algorithm: :concurrently which maps to Postgres's CREATE INDEX CONCURRENTLY. The trade-off: concurrent index creation can fail silently if there's a constraint violation. After the migration, always verify the index exists and is valid: SELECT indexname, indisvalid FROM pg_indexes WHERE tablename = 'users';
3. Removing a column the app still references
A rolling deploy means the old app version is live alongside the new one for a few minutes. If the new migration drops a column the old code still selects, every request to the old instance throws ActiveRecord::StatementInvalid.
The safe pattern is two deploys. Deploy 1: ship code that doesn't reference the column, and add self.ignored_columns = [:old_column] to the model so Active Record stops including it in SELECT *. Deploy 2: drop the column.
4. Renaming a column
Same problem as removal but worse — both names exist nowhere simultaneously. The safe pattern is a four-step dance: add new column, deploy code that writes to both and reads from old, backfill new column, deploy code that reads from new, drop old. Five deploys, two backfills. This is why "just rename it" is the wrong instinct on a production table.
5. Changing a column type
Changing integer to bigint on a large table rewrites the table and locks it. The safe pattern: add a new column with the target type, dual-write, backfill, swap the read side, drop the old column. Postgres 12+ has some optimizations for specific type changes (e.g., varchar length increases) that are metadata-only, but the rule of thumb is: assume any type change rewrites.
6. Adding a foreign key without validating separately
add_foreign_key :posts, :users takes an ACCESS EXCLUSIVE lock on both tables while it validates every existing row. On large tables this is minutes. Split it:
# Migration 1 — add FK as NOT VALID (instant, no lock)
add_foreign_key :posts, :users, validate: false
# Migration 2 — validate (acquires only SHARE UPDATE EXCLUSIVE — doesn't block reads/writes)
validate_foreign_key :posts, :users
7. Setting NOT NULL on an existing column
change_column_null :users, :email, false scans every row to validate the constraint, holding an ACCESS EXCLUSIVE lock the whole time. The safe pattern on Postgres 12+:
# Migration 1 — add a CHECK constraint as NOT VALID (instant)
execute "ALTER TABLE users ADD CONSTRAINT users_email_null CHECK (email IS NOT NULL) NOT VALID"
# Migration 2 — validate the constraint (no exclusive lock)
execute "ALTER TABLE users VALIDATE CONSTRAINT users_email_null"
# Migration 3 — flip the column NOT NULL (Postgres uses the validated check, instant)
change_column_null :users, :email, false
execute "ALTER TABLE users DROP CONSTRAINT users_email_null"
The 6-step pattern for adding a NOT NULL column with a default
The canonical example, end to end. Adding country (NOT NULL, default "US") to a 50M-row users table:
| Step | What | Why |
|---|---|---|
| 1 | Add column nullable, no default: add_column :users, :country, :string | Metadata-only, instant |
| 2 | Deploy app code that writes country on create/update | New rows are populated going forward |
| 3 | Backfill existing rows in batches (rake task, not migration) | Batching avoids long locks; throttling avoids replication lag |
| 4 | Verify no NULLs remain: SELECT COUNT(*) FROM users WHERE country IS NULL | Don't trust the rake task — confirm |
| 5 | Add CHECK constraint NOT VALID, then VALIDATE (see operation #7 above) | Validates without an exclusive lock |
| 6 | Set change_column_null :users, :country, false + set default for new rows | Instant; Postgres uses the validated check |
This is 3-5 deploys spread over a day or two. It feels slow. The alternative is the 8-minute lock that brings down checkout on Black Friday, so we take the slow path.
Backfilling without taking down replication
Backfills are where most teams hurt themselves the second time. The rules:
- Batch. Update 1,000-5,000 rows per transaction, not the whole table.
- Throttle. Sleep 100-500ms between batches. Without throttling, you'll saturate replication and cause read-replica lag, which often looks like an outage to your reporting/analytics consumers.
- Run outside migrations. Backfills go in a rake task or a background job, not in
db/migrate/. Migrations should be metadata-only; data changes belong in code you can rerun, monitor, and stop. - Use
in_batches, notfind_each, for updates.in_batchesissues bulkUPDATEper batch;find_eachupdates row-by-row.
namespace :backfill do
task country: :environment do
total = User.where(country: nil).count
done = 0
User.where(country: nil).in_batches(of: 2_000) do |batch|
batch.update_all(country: "US")
done += batch.size
puts "Backfilled #{done}/#{total}"
sleep 0.2 # throttle replication
end
end
end
This is the pattern we used on every backfill in the RankLoop SaaS rebuild. RankLoop runs uptime-sensitive customer dashboards — taking the DB down during deploy isn't an option, so every schema change followed the 6-step pattern above.
Three production failure modes we've seen
"It worked in staging"
Staging has 10k rows. Production has 50M. The same migration takes 200ms in staging and 8 minutes in production. The fix isn't faster staging — it's running EXPLAIN ANALYZE against a production-sized clone and using the lock_timeout setting so the migration fails fast rather than hangs. Our Rails performance optimization guide covers production-size testing in more detail.
The deploy timeout that becomes a deploy lock
CI/CD runs the migration with a 5-minute timeout. The migration takes 8 minutes. CI kills the process, but Postgres doesn't roll back immediately — the transaction stays open, holding the lock. New deploys queue. The whole pipeline is jammed for 30 minutes. StrongMigrations.statement_timeout in the initializer prevents this by aborting the migration cleanly.
Backfill that takes down read replicas
Engineer writes a one-liner backfill: User.where(country: nil).update_all(country: "US"). Single statement, single transaction, but the WAL stream from this one statement is gigabytes. Read replicas fall 30+ minutes behind. Analytics dashboards show 30-minute-old data, customer support thinks the system is broken. Always batch and throttle, never single-statement on tables over 100k rows.
Where this fits with Rails deployment
Zero-downtime migrations only matter if your deploy process is itself zero-downtime. Most of the migrations patterns above assume a rolling deploy with at least two app instances. If you're still single-instance, the migration safety helps but you'll still have request-drop windows during the deploy itself. Our Kamal deployment guide covers the deploy-side of zero-downtime, and our Active Record best practices post covers the data-access patterns that make migrations less frequent in the first place.
For legacy Rails apps where migrations have accumulated over years, our legacy Rails modernization writeup covers how to introduce strong_migrations into a codebase that already has 200+ historical migrations without flagging all of them as unsafe (the start_after config does this).
External references worth bookmarking
- strong_migrations on GitHub — the gem README is the most up-to-date reference for which operations are flagged and the recommended alternatives.
- PostgreSQL CREATE INDEX docs — the authoritative reference on CONCURRENTLY, its limitations, and what happens when it fails.
- GitLab — avoiding downtime in migrations — GitLab runs one of the largest production Rails apps on Postgres and has the most detailed public playbook we've found.
FAQ: Zero-downtime Rails migrations
Does strong_migrations slow down development?
No — it runs at migration-file write time and only flags operations it considers unsafe. It also gives you copy-pasteable alternatives in the error message. The dev-time cost is the 30 seconds to read the warning; the prod-time savings are the 8-minute outage you avoid.
Can I add a NOT NULL column with a default in one migration on Postgres 11+?
For static defaults on small tables (under 1M rows), yes — Postgres 11+ stores the default as catalog metadata without rewriting the table. For dynamic defaults like gen_random_uuid(), no — those trigger a full rewrite. The safe rule across all table sizes is still to split it into the 6-step pattern.
What's the difference between disable_ddl_transaction! and a normal migration?
Rails wraps every migration in a transaction by default so the schema change rolls back cleanly on failure. CREATE INDEX CONCURRENTLY can't run in a transaction (Postgres rule), so concurrent index migrations must call disable_ddl_transaction!. Use it sparingly — outside a transaction, partial failures don't roll back automatically.
How do I add an index to a 500M-row table without downtime?
Use add_index ..., algorithm: :concurrently in a migration with disable_ddl_transaction!. Expect the build to take minutes to hours depending on table size and write volume. After it completes, query pg_indexes to verify indisvalid = true — concurrent builds can fail silently and leave an invalid index that the planner won't use.
Should I use the Rails strong_migrations gem or write custom CI checks?
Start with strong_migrations. It's the de-facto standard, covers 90%+ of unsafe operations, and is maintained by Andrew Kane (PgHero, pretender, dozens of production-tested gems). Custom CI checks make sense as a layer on top — e.g., a check that no PR adds a migration over 50 lines or a migration that touches more than one table.
How we can help
At TechVinta, we ship Rails apps that handle production schema evolution as a continuous, zero-downtime activity, not a quarterly emergency. Most of our engagements with established Rails teams start with a migration audit — we look at the last 50 migrations, identify the unsafe patterns, and stand up strong_migrations + a deployment workflow that prevents them going forward.
Running Rails in production and worried about your next big migration? Talk to our Rails DevOps team or get a free estimate — we'll audit your migration history and propose a plan within 48 hours.