This is Part 4 of a 5-part series on Evernote’s transition to Google Cloud Platform (GCP). To read this series from the beginning, see Part 1.
In planning the migration of the Evernote service, we wanted to minimize user downtime wherever possible, but also knew our existing architecture would not support a zero-downtime migration. We had the following requirements as inputs when planning the migration methodology and maintenance windows:
- We wanted to minimize the duration we were running in a ‘split site’ configuration.
- We wanted to start slow migrating single shards to begin with, and ramping up to 8 shards in parallel.
- We wanted an observation window (~48 hours) once a small number of shards were running in Google to observe the environment carefully running in steady state.
- We wanted a max of 32 shards in parallel – any more and we would’ve considered it too risky. The blast radius would be too big to manage if things went wrong.
- We had to move UserStore during a weekend maintenance window of under one hour to minimize potential customer impact as this would be full downtime for the whole service.
- We wanted no more than 30-minute downtimes (less if possible) for the migration on a per-shard basis. User downtime is never good and something we are always trying to prevent. In this case, however, our application would not support transparent failover of shards. In the end, the impact for users was reduced due to the fact that most Evernote clients synced within a 15 minute basis. In other words, this meant some users would not notice the downtime.
- We wanted to have a plan that allowed us the flexibility to respond to unknowns that are inevitable in a project of this size.
- We recognised that as we run a truly global service there were never going to be ‘out of business hours’ maintenance windows.
Taking this all into account and our expected shard migration rate, we settled on the following initial plan which was communicated to users via our forum and social media:
Shard migration windows would be 09:00 to 21:00 PDT (Pacific Daylight-Savings Time) each day:
- December 8 & 9 – Shard Migrations
- December 11 – UserStore Migration
- December 14–16 – Shard Migrations
From December 16–31, we would all be on close watch. All Operations and Engineering staff would be on hand 24/7 to rapidly respond to issues and failures.
In parallel, starting December 1, we also began migrating all of the non-customer impacting services such as image recognition and our frontend traffic.
Technical Deep Dive on the Migration
When planning the migration of NoteStores, we were able to leverage prior art from our Great Shard Migration. In order to move between hardware versions in the physical datacenter, we had developed a process and scripting to move shards onto new hardware. While this was not totally reusable, it did form the core of our new cloud migration methodology detailed here.
The first thing we needed to do was to create the new shard in Google Cloud and then start the initial data sync to that new shard.
Create empty shard
We began by creating a new notes store using Google’s deployment manager and our existing Puppet build scripts. This had the same name and configuration as the production shard but had no customer specific data yet and was not configured in our load-balancing platform.
- There were two datasets on a shard containing user-specific data that needed to copied over:
- MySQL database containing all user note test from that shard
- Copy the latest copy of the production MySQL backup
- Prepare the backup
- Start the instance
- Setup replication user on both source and destination
- Start replication between nodes, beginning at point of backup
- MySQL database containing all user note test from that shard
- Lucene Indexes – Unlike MySQL, Lucene does not have a native way to replicate its data so we needed to use custom tools in order to move this dataset. We wrote a tool around the ubiquitous rsync utility which we ran from cron on each of the source hosts, and sent metrics/logging into our usual Graphite/Splunk ingestion pipeline.
We started the build and datasync processes about 5–10 days before the expected service cutover of a batch of shards.
We iterated as we went through this process, updating our approach:
- The original MySQL setup was to leverage the daily backups and binlogs to prepare a fresh restore as a mysql replication target. This worked, but it turned out our backup servers couldn’t handle the dozens of processes we wanted to have running in parallel.
- In order to meet our parallelism requirements, we set up a new GCS bucket as a backup target. We rewrote our ansible playbook to instead do an encrypted and compressed streaming innobackupex backup and write that directly to the GCS bucket.
- When the backups from each shard were available, we used a new playbook to attached additional PD storage to the replication target, download, uncompress, and prepare the backup.
- A further playbook would run a tool we had written during previous migrations called “genesis” which would verify the MySQL configuration, restore the backup, change passwords, sanity check start/stop, upgrade the database, and finally start MySQL replication.
- While our Lucene replication tooling was rock solid, we immediately noticed that we were saturating our switch uplinks and, more importantly, our VPN connectivity from our datacenter to GCP. Due to some technical details with our shard-to-VPN, tunneling mapping, we had to orchestrate the number of Lucene replications we could start at once. (The initial syncs were very heavyweight in bandwidth, with subsequent syncs only replicating changes from last sync).
- We tweaked the mapping a bit, taught ansible to batch replication sets accordingly, ran and watched the initial Lucene sync, and then set up a cron sync before moving to the next set.
The objectives here were to cut the service over safely and quickly in order to minimize downtime wherever possible.
- Stop the production shard and remove it from the load balancer, ensuring no customers can make updates to it.
- Run final incremental data sync.
- Start new shard (with synchronized data).
- Run a set of tools (what we called ‘confidence’) to check data was successfully replicated, up-to-date, checksums match, and more.
- Run QA validation script on new shard – This validates the shard is functioning as expected.
- Kickoff backup of new shard.
- Open new shard to users.
- Close watch of logs, transactions and monitoring alerts.
In order to achieve the velocity we needed, this could not be a manual process. Once we had confirmed this process would work, we used ansible to automate the process.
Our scripting allowed us to migrate each of our shards (containing 300k users on average) without human intervention, in under 30 minutes. We ramped the parallelism to enable us to run the migration of 32 shards in parallel. This gave us the ability to migrate ~10m users in under 30 minutes.
During these migrations, the operations team divided into very specific roles to ensure nothing went wrong:
Project Manager – This individual coordinated the entire migration, and made decisions on items such as batch size, when to start a batch, and how to communicate progress outside the team.
Lead Engineer – This individual had complete control over migrating a batch: he determined confidence prior to starting a migration, controlled the tools to run the migration, and made all the troubleshooting decisions.
Monitoring Engineers – One or two engineers who constantly watched our monitoring systems for anything abnormal, as well as to let everyone know when monitoring indicated a migration was complete. These engineers also ran our QA tools and had final say on the success of each migration.
Fix Team – We had 2-4 engineers on hand at all times to investigate and fix any problems identified by the lead or monitoring engineers.
The UserStore migration followed the same basic steps used in the MySQL migration of NoteStore, with the major difference being the need to update the JDBC URL on all of the NoteStores to point at the new host.
As the UserStore sits in the middle of the whole Evernote service, we planned a specific one-hour maintenance window for the migration early one Sunday morning. As it is a high-transaction database that backs many transactions within our service, we ensured we had a large group of Operations and Engineering resources in the office to monitor the migration. We also had a dedicated team from Google present ready to jump in if we saw issues (they also brought in Sunday morning donuts for the team!)
This was probably the most intensive hour of the whole migration as the whole service was down and the clock was ticking.
The migration path for our other services was usually simpler as we could create the new service in parallel and gradually cutover in most instances. We were able to use some of our existing Puppet code and patterns to make the build easier but saw this as an opportunity to clean up many years of Puppet code and simplify.
So how did we do against our original plan?
Overall we were able to stick to our original plan fairly well, although we did deviate twice:
- One extra restart of UserStore on December 15 to update parameters to ensure it ran well in our new environment.
- One extra shard migration day (December 19) to migrate the final 96 shards. We could have rushed and pushed these final shards through but we decided to take a safer approach and extend the overall window by one day.
These changes were communicated via forums and social media.
Data on the shard migration
Each bubble on the diagram below represents a batch of users moved simultaneously, the bigger the bubble, the more users in the batch. The y-axis shows the number of users moved per minute, and the x-axis is simply the batch number. Each color represents a different day.
- 11/23 – We were migrating test user shards that run in our production environment to finalize procedure and tooling
- 12/08 – No shards migrated as we found some final bugs in the code that need to be fixed
- 12/09 – Start small and build up from there
- 12/14 – We hit our stride and moved 77 million users in a single day.
- The largest batch was #33: 9,463,546 users in 24 minutes (394,314 users per minute).
- The fastest batch was #37: 9,244,151 users in 14 minutes (660,297 users per minute).
- The shortest batch was #43: 4,042,851 users in 11 minutes (367,532 users per minute).
- The longest batch was #51: 147,765 users in 30 minutes (4,926 users per minute).
- No single user was out of service for longer than 30 minutes.
In Part 5 we discuss the Conclusion and Future.
If you have any followup questions please join us on the Evernote Forums.