mongodb – LINISNIL

my team and i recently completed a database migration from mongodb to postgresql for one of our rails apps. the service is a graphql api built on rails 7 and is backed by a mongodb database (m40 cluster managed through mongo’s atlas platform) with ~500gb of data and we performed a live zero-downtime migration to a db.m5.2xlarge RDS running in our own aws account . the application is organized like a pretty standard rails app. all data is represented by rails models and data access is done through an object mapping layer using mongos object document mapper (ODM) mongoid.

the requirements for this project were pretty straightforward

stop using mongo
dont take our service down to do an offline migration (given the amount of data we needed to move, the maintenance window we would need would’ve been way too long anyway based on some of our initial test)

our high level approach was to use the double writing pattern by dual writing to both data stores and put reads behind dynamic feature flags, backfill the tables one collection at a time, switch over the reads to the new database and then cut off the old read and writes.

this is a very common technique in service to service migrations when teams undertake monolith to microservice transitions (which were all the rage five to ten years ago, but the trend is reversing as of late) and the same process can be applied to switching data stores within the same service. the new reading/writing code in the service hit a new storage instead of the new api / service.

setup phase

we started by setting up an initial connection to postgres and added some basic tooling

set up the postgres database and the rails integration. our infrastructure teams spun up our new postgres instance on RDS sized comparably to the current storage on atlas. in the rails app, we setup active record ORM alongside the existing mongoid ODM and updated both our development and CI setup to spin up a postgres image

set up data transfer / backfilling utility scripts that extract mongo document data for a given collection and transform it into an postgres compatible format and inserted it into the postgres database. for example, nested documents become normalized foreign key relationships

set up feature flagging (we used flipper) to dynamically control the reading switch (double writing was not behind switches but we made sure to wrap our new writes with catch-all exception handling to never interrupt requests

double writes

we divvied up most of the work by resource types and tackled them in the order of some combination of entity complexity (lots of relationships, super nested) and data volume (getting an early start on the largest collections was important since we had deadlines to hit).

for each resource in the system, we did the following

create active record equivalents of the current ODM models. so this means bringing over model level unit tests, validations, and any database level constraints. to uniquely identify migrated data, we made sure to include a mongo_id column on every new table
set up dual writes. most of the writes happen through graphql mutation resolvers at the graphql API layer so this involves adding adjacent active record write logic.
duplicate existing unit and functional tests to cover the new models and code
set up the backfilling code. the shared migration script was sufficient for most of our data (simple batch read, transform, bulk insert), but a handful of our models with more complex entity relationships necessitated their own migration logic

backfill and read rollout

once dual writing was enabled for a while and we’re confident there are no issues with the new data, run the backfill scripts. depending on the collection, this took anywhere from minutes to days
upon backfill completion, verify the successful migration using a custom built data verifier script that ensures that all the mongo documents were successfully transferred. this script knew how to compare both simple flat docs and ones with very nested relationships by using rails model level reflection API
finally, switch the reads from mongo to postgres. this was done through flipper so no additional deploys are necessary

cleanup

once all dual writing is setup and all reads are done against postgres, remove the double writing and only keep our postgres active record reads and writes.
remove all traces of mongo
celebrate!

challenges

no project is without its challenges / setbacks and wow we had a number to deal with (and overcome!). we had issues on every stage of the sdlc

coordinating with other teams making changes to the service. we had to enact a code freeze since we were running into instances of people introducing new writes without the flags/dual writing stuff we required
wading through hard to understand business logic areas with low test coverage. we needed to create active record equivalents of a lot of writes, but some writes were fairly complex (very stateful, lots of conditions) and involved a coordination of multiple domain models
keeping the new active record models ,tests, scripts isolated. we can’t just delete the current application code so the new models needed to live alongside the old ones. since we wanted to preserve the model names as possible but you cannot have two models of the same name in models/ so we introduced a postgres namespace across the board to house the new code. this was a fantastic solution that made it both easy to add new models and delete the old ones later
database schema migration automation problems. we initially were running the new rails schema migrations by hand but when we switched over to automating the schema migration using k8s/helm, we accidentally made migrations run one off jobs (instead of pre-release hooks). as a result, we had deploys still succeeded despite failing migrations
some of our collections are large, so our backfill scripts need to run anywhere from several hours to several days. this increases the likelihood of running into issues mid data transfer, so it’s important for scripts to be idempotent and resumable. for the idempotent part, we did this by adding mongo_id primary key reference to all of our postgres tables to represent the identity of the mongo record migrated (in most backfilling instances with only a couple of exceptions, we skip the insert based on the mongo id if it’s already migrated). for resumability, during migration we always read mongo documents ordered by their primary key (lucky for us the first four bytes of the 12 byte id is the creation timestamp) and we log out the last key in the current batch during migration processing as a checkpoint to use later as a cursor
set off alerts when running backfills because of elevated read / writes against postgres which were in the call path of all existing requests. we ended up creating a read only mongo replica off of our primary in atlas to use for our backfilling. unfortunately, while this solved the contention issue we introduced new problems around data consistency. for example there was an instance where i ran the backfill against an outdated replica and ended up inserting stale records into the new database. luckily the verifier detected missing records and i was able to drop the table and re-run the backfill with a fresh / up to date database instance
missing mongo key constraints and existence of duplicate records. we had a number of collections containing dupes due to missing uniqueness indices, so when we added the appropriate uniqueness constraints to the new tables in postgres, the backfilling process blew up because the mongo data was bad. this required some data cleanup and one of my teammates wrote a handy de-duping script using mongos aggregation API to identify and remove dupes by gathering dupes for a any given document key combination into lists and then keeping the latest by purging the dupes.
- one minor snafu we ran into this was that the aggregation code does a lot of the grouping of documents in memory on a node and in one instance this caused a memory spike that impacted avg performance while the script was running
- based on the logs, we seem to get a good number of duplicate insert errors due to race conditions of requests attempting to modify the same resource at the same time, which probably explains why we had so many dupes in the old database to begin with. most of these cases can be ignored but it would be good to figure out why they’re happening so often
bad new data being inserted into our postgres database due to incorrect new code. for example, there was a situation where we were writing a UTC offset attribute into mongo through the ODM and when this got carried over to the active record class, it was only writing positive UTC offset values and excluding all negative offsets due to a bad guard clause i added. oopsie
- we also had minor and more suble bugs like timestamps not being properly updated. for example in active record we needed explicit .touch to update when no attributes changed but clients expected an updated timestamp. this was happening out of box with mongoid
data divergence happening in dual writing code during upserts that were caught by the verifier. for example, some records had fields that accrued values over time, but once dual writing got introduced and it got executed by a new request, only the most recent data in the payload is inserted into the new database (the original values accrued on a field in the mongo database were not carried over). unfortunately, this data gap wasn’t addressed by our backfilling because our backfilling code skips dual written records, so the historical values were never carried over for that record during that process.
- to illustrate this with a scenario: lets say a mongo record was created before dual writing and it’s field values gets value 1. time passes. we release the dual writing code. a new request wants to upsert the same record but this time with value 2. two writes happen: one to mongo, which ends up with [1,2] and one to postgres, which only has [2] (the most recent value).
- to fix these issues, we wrote one off data sync / repair tasks to fix these diverged records. this was pretty much an issue for any record that performs upserts and whose backfilling strategy was an insert_all (skips on conflict) are candidate for divergence.
contending with ongoing performance problems of the service trying to differentiate between whether degraded performance impacted by our new code or what was already there (turns out a little bit of both!)
on rolling out a read for a single high traffic collection, the entire service went down for a solid 5-10 minutes where i couldn’t access the flipper UI because none of the pods were responsive. turns out this was caused by missing indexes that was causing RDS CPU to be pegged at 100% due to full table scans happening in RDS against the collection

we did a pretty great job managing these issues as a team and right now we’re fully on postgres and it looks like it’s running smoothly so yay!

MongoDB read preferences give you control over read behavior when using replicasets. Writes in every environment go to primary, but reads can be configured to read from secondary or primary based on various criteria. In most versions of mongo, the read preference defaults to primary in the client but you should check your version for the default.

primary

Read from the primary. if the primary is down, the read will fail. This is useful if you need the latest successful write and have zero tolerance for stale data. The terms “latest” and “successful” are a bit of a sliding scale depending on the read concern. For example, with read concern local to a primary, it’s possible to read writes from other clients that have not been acknowledged by the primary node. In practical terms, this just means that the data could be lost in the event of a rollback event since it may not have been replicated to a majority of nodes.

Regardless of a write’s write concern, other clients using "local" or "available" read concern can see the result of a write operation before the write operation is acknowledged to the issuing client.
https://www.mongodb.com/docs/manual/core/read-isolation-consistency-recency/#read-uncommitted

But I thought mongo maintained a WAL using an on-disk journal? Why would data be rolled back at all?

Yes, the journal is used for data recovery in the event of a node shutdown. But when a primary node goes down, a new node is elected as the primary, and then the previous primary rejoins the cluster as a secondary it may have data / writes that are ahead of the new primary. Since the new primary is the source of truth, those writes need to be rolled back to be consistent with the new primary!

If you want data that has the highest durability level, use majority read concern. When read concern is majority, it will only return data that has been successfully acknowledged by a majority of nodes. However, this does sacrifice some level of data freshness because the latest write from a client may not have been replicated to the majority of nodes yet, so you may be reading an older value even though you’re reading from the primary.

Another factor to consider when choosing this setting is your primary nodes capacity and current load – if your primary is close to capacity on cpu or memory use, switching reads to primary may impact writes.

primaryPreferred

Read from the primary, but if the primary is unavailable, the read goes to a secondary. This setting is appropriate cases you want the most recent write, but you can tolerate occasional stale reads from a secondary. Replication lag can range pretty widely from single digit ms to minutes depending on the write volume, so keep in mind that during periods of high replication lag you’re more likely to be reading stale data.

secondary

All reads go to secondaries. Not all secondaries though – just one based on a server selection process. The server selection works like this:

First, there’s the default threshhold value which is 15ms. Now this isn’t the window used for server selection though – it’s only a part of it. The other part is the lowest average network round trip time (RTT) of secondary nodes. So if the closest node has an RTT of 100ms, the total set will be nodes that fall within 100 + 15 (115ms). It’ll use this time to randomly select a node to forward the request to.

This read preference can make sense if your system is very read heavy AND you can tolerate stale reads. Scaling out with many secondary nodes is one way to spread your read load and improve read availability. At the same time, it may also add latency to both your primary (due to the increased oplog syncing), increase inconsistency and likelihood of data staleness because there are now more nodes that are likely to be out of sync with writes, and potentially reduced write latency particularly for writes that require majority acknowledgement because you’ll need N / 2 + 1 acknowledgements.

secondaryPreferred

All reads go to secondaries, but if not available, read from the primary. The same concerns for “secondary” read preference applies here. The additional factor to consider here is whether your primary can afford to take on this additional load.

nearest

Doesn’t matter which, just pick the one with lowest latency based on the server selection algorithm I mentioned in the secondary read preference section.

Tag: mongodb

migrating a rails app from mongo to postgresql