Since inception of the service in 2008, Evernote has owned, configured, and maintained its own servers and network. This approach gave us the ability to build the service we wanted the way we wanted to build it. But it was also limiting – difficult to scale, slow to upgrade and expensive to maintain. While the infrastructure we had was well suited to support Evernote at the time, it lacked the speed and flexibility we needed for tomorrow.
The decision to move to the public cloud was an easy business decision and an exciting one for all of us at Evernote. Since we made the first announcement, we have been working in the background to complete the migration out of our physical datacenters into our new home on Google Cloud Platform (GCP). And seemingly in the blink of some very weary eyes it was over, we completed the migration from a standing start in 70 days!
For those of you who want to understand how we approached and executed the migration or are even considering such a migration, we want to give you a deeper view of what we did and how we did it so quickly. This is not intended to be a full runbook of the migration, but should cover the key decisions we had to make and how we approached them.
It’s probably best that we start at the beginning with an overview of the Evernote service.
The Evernote Service
Our service is comprised of the following building blocks.
The Shard (NoteStore)
This is the core unit of capacity/service at Evernote. It is the building block that stores and serves users notes. Each shard can support upwards of 300,000 Evernote users. A shard contains:
- A front end web service layer that Evernote clients connect to – this is Tomcat based.
- Data Storage Layer – A MySQL database that contains the user’s notes.
- Search Index – Server side Lucene search index for user content.
There are 762 of these shards, or NoteStores, which in aggregate manage 200 million user accounts and store approximately 5 billion user notes.
A MySQL based central user and preferences database that stores all centralized information about users and manages their authentication. As this database manages all user state and authentication for the service, it is the most critical and complex part of the service and something we are always extra careful with.
User Attachment Storage (Resources)
We have a separate file storage layer that is used to store the 5 billion user attachments (we call them resources). This consists of 206 self-contained WebDav servers. When you upload an attachment to Evernote, your Shard saves that attachment to two different WebDavs locally. One copy is sent to offsite WebDavs located in our remote disaster recovery datacenter.
Front-end load balancing
We run a high-availability, load balancing cluster which both accelerates and terminates TLS and is responsible for routing a user’s request to the appropriate backend shard where their note data resides.
Lastly, we have another ~200 Linux servers that perform caching and batch processing functions, such as handwriting and text recognition.
Our Migration Options
With a service the size of Evernote, architecting the migration to the cloud is always going to be a complex undertaking with multiple dependent decisions that need to be made. We wanted to move quickly and iterate, so we took the approach of building a strawman based on key, strategic decisions. Then, where possible, we tested to see if the strawman was a viable plan. This allowed us to rapidly iterate our plan.
We started by understanding what needed to change and knew there would be components that did not simply (or directly) translate into our new, cloud world. We started by splitting the components of our environment into two categories:
- Lift and Shift – These are systems that would look broadly the same in their new home in GCP. The Shards, UserStore, and most of the supporting services were all in this group. They were Linux-based in our physical datacenter and would migrate to similar Linux virtual machines in the cloud.
- Transform – User Attachment Storage, our Load Balancing layers, and Recognition service (Reco) would need to go through significant transformation as part of the migration. There was either no equivalence of that service in the cloud or a much more efficient way existed in the cloud.
We then needed to consider our migration options. For Evernote, there were two obvious options:
- Big Bang – This is where there is a single point in the migration project where you cutover 100% from the old to the new. This was more popular (probably as it was often the only option) during the days when everyone ran their own datacenters and application architectures were monolithic and less distributed. This is also known as the ‘cross your fingers, close your eyes, press the button and hope everything is OK plan!’
- Phased cutover – This is a divide and conquer approach where you migrate services in phases or waves, either grouped by service or user. Such models also often allow you to ‘test’ or validate parts of the migration before committing to the whole move.
As we reviewed our specific situation, it was clear that ‘Big Bang’ was not an option. Even with the best planning, we would be taking too much of a risk to cut over in one move and hope we had covered all bases. We also knew our application had not been designed to run ‘split’ between multiple locations, though there were indications that we could’ve probably made a ‘split’ environment work for a period of time.
So we needed to find a happy medium between the two extremes. We wanted, if possible, to plan for something we termed ‘Accelerated Phased Cutover’. The overall service migration would occur during an intense period of under 20 days, with very specific phases within the overall migration window that minimized risk at each step. Such a plan would also allow for rollback points if things just did not work as expected.
In Part 2 we discuss Protecting Customer Data on GCP.
If you have any followup questions please join us on the Evernote Forums.