This is the final post in a 5-part series on Evernote’s transition to Google Cloud Platform (GCP). To read this series from the beginning, see Part 1.
All of the planning and processes explored in parts 1-4 of this series got us ready to build the next generation of Evernote. So how were we able to achieve the migration so quickly?
In order to stand any chance of achieving the objective, we needed to minimize distractions as they are always a sink of time and could lead to failure. We had to say ‘no’ to other less important requests coming in from the business and order to do that we needed to ensure the whole company knew why we were doing this. We found brownbag sessions were the best format for this as they provided a forum for people who wanted to know more and an opportunity for them to ask questions.
Having the right partner
In the early days of the project, when we were discussing how quickly could we migrate, we did not want to accept the normal and wanted to set a big, hairy, audacious goal. The great thing about working with Google on this was they never shied away from that challenge. Every time we found something that could have derailed the plan, the Google team stepped up and delivered.
It was great to see a team of engineers from two separate companies working for a common shared goal.
Making it real, and as quickly as possible
Conventional wisdom states that you should start these types of migrations with something small and non-critical because this is where you learn. This is true to an extent, but you have to be careful. Our advice is to move through this phase rapidly and consider migrating parts of the environment that matter as this is the only way you can truly shake out all of the issues you will encounter.
There are some fundamental issues with starting small and outside the critical path:
- Something small – You will never hit any issues/edge cases associated with size or scale. You can always try to benchmark or loadtest but it is usually impossible to recreate the exact conditions you will see in production. They often lull you into a false sense of security.
- Something non-critical (less important) – By definition, this is something that you care about less and therefore are more willing to accept odd issues and failures. These often don’t get fixed and can turn around a hurt you when trying to run Production services.
Starting small is a great way to gain familiarity with a new platform and perform basic testing. It is a step you need to complete, but should not assume that it will confirm that your production service will run.
During the migration we tried, where possible, to accelerate the test migration of production services as soon as we reasonably could.
In order to do this, it helps if you have an application/system where you can separately migrate individual parts or components. Newer, more modern microservice-based architectures lend themselves well to this paradigm, but it can be extremely difficult for older, monolithic application stacks.
In short, a high-traffic component outside of the user’s critical path which fails gracefully would be an ideal place to start.
Throughout this migration project we utilized automation to accelerate wherever possible. At some points we leveraged existing tooling. At others, custom tools were created for the purpose of this migration:
- Creation of instances within GCP – We created tools around Google’s Cloud Deployment Manager to build reusable templates for provisioning the bare instances, and leveraged our existing Puppet infrastructure for configuration management.
- Migration of 12 billion files – We created both the migration script and overall control framework that allowed us to copy files at the highest rate possible 24/7. We had approximately 5 billion user files to copy, but some of them also had associated metadata files such image thumbnails. In total, we copied 12 billion files.
- Shard migration and cutover – We created new Ansible-based tooling to do the initial seeding of data and then further scripting to fully automate the steps to migrate a shard into the cloud.
There was no way we could have completed this migration in such a short space of time without focusing efforts on automation.
During our early planning we recognized that our network bandwidth between our current datacenters and GCP was going to constrain us in multiple ways. It was going to limit our ability to copy a large volume of data and constrain our ability to split the service between our existing datacenter and Google cloud.
We identified this as a constraint that could negatively impact multiple streams of work. Knowing this, we put extra time and effort early in the project to maximize the available bandwidth between sites. We actually went as far as placing interconnect orders with multiple cloud providers before we had made our final selection to save precious days.
What did we learn?
Overall, we could not be happier about how the migration went. Having said that, we had a number of learnings:
Your application will successfully run in the cloud but…
The cloud has reached a level of feature maturity. In most cases, your application will happily run. That said, you do need to ‘tune’ your application to run in its new home, where it will encounter a different set of constraints. Plan for this in advance, and allow a period of ‘tuning’ right after you migrate.
For example, we had to tune our environment to better cope with disk throttling. When running in our own datacenter, we could run backups as fast as possible. We were not running in a shared environment or from shared storage devices, so over time we had configured our backups to consume all available disk IO. But when running in GCP we would automatically start a backup that would then cause the virtual machine to be IO throttled and cause upstream application slowness. We fixed this by reining in the backup to not run a full rate.
We have also had to tune our environment to better cope with Live Migration. This is a great Google backend feature that lets them migrate VMs between hosts to allow for maintenance and, in most cases, is invisible to the customer. We have found that UserStore and some of the earlier, more active Evernote shards are unhappy while being live migrated as they are high IO machines. Even a very small, blackout or brownout window is enough to cause disruption. We have tuned MySQL and our application to better manage this. It’s not yet perfect, but it’s something we will continue to refine with the help and assistance of Google.
Customer communication is key
We needed a better way to send upcoming maintenance communications to our customers. We did not want to blast millions of users with emails they would probably not read. So, we decided to use our Evernote Community forums and social media to communicate the upcoming maintenance windows to our to users. While this worked well, we did have a small number of users who were not aware of the maintenance and were impacted. This is a gap we would like to close and we are considering in-app communication to better accomplish this in the future.
While not unexpected, we have been chasing down a variety of small bugs and hiccups associated with running in a new environment. While some are more painful than others, we ultimately have been able to work around them by tuning existing configurations or working closely with Google to close gaps on our platform.
So what now?
Now that we have completed our migration, we can look to the future. One of the reasons we moved to the cloud was to continue to remove friction from our Engineering processes and give engineers access to a much wider toolset to build with. We know that over time, we will migrate away from Virtual Machine-based building blocks and embrace containers and serverless technology in the drive to focus on what’s important.
We are already starting to see the benefits
Evernote Hack Week
Every year, Evernote runs a Hack Week (usually in January). This is an opportunity for everyone in the company to focus their efforts on something that they believe is important but may not be in our formal product backlog. It’s been amazing to see what people can create when creativity is unleashed and teams and ideas are allowed to self-form with minimal oversight.
In the past, the Operations team has spent time supporting the event, building servers and so forth. This year was very different – we simply gave the engineers access to a separate GCP project and left them to it. They were able to easily provision services to support their projects with minimal added workload for Operations staff, who were all but sequestered to complete the migration project.
Last year, we also made the decision to move away from our existing, monolithic application architecture to a microservice-based one. Adopting containers is a natural step towards that goal. We did some extensive research and eventually landed on Kubernetes as our orchestration framework. Running a production grade Kubernetes cluster can be a daunting task, but fortunately Google Container Engine (GKE) simplifies the whole process.
We currently have a GKE cluster running within our production project supporting our next generation search infrastructure. Watch out for another followup post to this blog on that specific subject.
Despite improvements to our current architecture, there is still a lot of overhead associated with Operating System instances (1000+), whose only purpose is to run the Evernote code. In the ideal situation, we would just deploy code and Google would be responsible for running that code at scale.
In that scenario, we would only use compute power to service the request from the user and no more. While there is a long way to go in this space, we are investigating Google Cloud Functions as our core building block. This would allow us to focus on writing code/features to solve user issues.
Last but not least – How do you name a cloud project?
Every great project needs a name. After much deliberation and consideration we came down to a decision between two. Our Software Engineering team had proposed ‘Cloudy Meatballs’ named after the book, “Cloudy with a Chance of Meatballs” by Judi Barrett. Our Operations team liked the codename “Bob Ross,” after the famed artist and TV personality who would paint nice fluffy clouds! The only way to solve such a deadlock was to take this decision to a company vote. Project “Bob Ross” won and we were off to a flying start.
We were also very lucky that our People Ops team found the following socks on Amazon!
We wear them proudly.
If you have any followup questions please join us on the Evernote forums.