This is Part 3 of a 5-part series on Evernote’s transition to Google Cloud Platform (GCP). To read this series from the beginning, see Part 1.
Our System Architecture
Our next big decision was to agree on the overall system architecture we would be building towards. We agreed that the following were important requirements/considerations:
- The Evernote application was architected for all services to run out of a single datacenter site serving clients globally (this excludes our separate China service).
- there was a requirement to always keep two copies of customer data in separate geographic regions.
- There was a requirement to minimize network latency between our current datacenter location in Northern California and the new production location in GCP. This would give us maximum flexibility during the migration and hopefully support ‘split site’ running for a short period of time.
Having taken all of these consideration into account, we settled on the following:
- Primary location (serving production traffic): US-West1
- Secondary location (disaster recovery site): US-Central1
We would further split all production services between two zones in US-West1 to further protect from failures.
We know for a fact that, as Google explains, ‘Google designs zones to be independent from each other: a zone usually has power, cooling, networking, and control planes that are isolated from other zones, and most single failure events will affect only a single zone.’
This would mean we would serve traffic out of US-West1 and keep a second copy of all customer data in US-Central1. In the event of a disaster scenario hitting US-West1, we would be able to restore service using the data we have in US-Central1.
Some of our future considerations revolve around how we could re-architect our application to be more resilient and how we could service traffic out of multiple regions at the same time to further reduce the time needed to recover from a disaster scenario. We are also considering how we can better use the global footprint of GCP to improve user latency when accessing the Evernote service.
At this point, we had defined the requirements and made a number of the strategic decisions. We now needed to get down into the detailed engineering.
Technical transformation
Maximizing our datacenter-to-GCP network connectivity
We identified very early on in the project that network connectivity between our existing datacenter and GCP was going to be both the key to our success and a constraint. To give ourselves maximum flexibility to migrate data and services, the network interconnect plan needed to achieve the following:
- Encrypt traffic flows between our current datacenter and GCP.
- The ability to use either location as the ‘front door’ for customer traffic or split between two ‘front doors’ as necessary.
- The ability to split our service across each location with some parts in our physical datacenter and others running in GCP.
- Maximize bandwidth between sites to enable the large data copy.
We needed maximum flexibility to ensure we could run the ‘front door’ traffic 100% through our existing datacenter and physical load balancing platform, while all backend Evernote services were running from GCP (and vice versa), and also while copying of 3 Pb of data to GCP.
Our existing external network connectivity was designed to handle our peak loads with plenty of headroom (an order of magnitude above peak). However, it was not designed with enough capacity to evacuate all of our data in a timely manner.
Additionally, our internal network was not structured in a way to funnel such a large number of requests out to an external service (such as Google Cloud Storage). Given the current state of the world, uploading all of our data would have taken over a year and negatively impacted our users’ experience at the same time. We had a lot of work to do.
First, we had to establish Private Network Interconnects (PNIs), or direct links between Evernote’s network and GCP. This effectively doubled the amount of bandwidth we could use and gave us dedicated links separate from our user traffic. This builds a private highway between Evernote and Google.
Second, we had to determine where, within our datacenter, the data would originate from. We store multiple copies of everything, so we needed to figure out which copy we’d use. We had to carve a path through our network to most efficiently move data from hundreds of servers and put it on the private highway to GCP without impacting normal operations. This involved careful coordination and sequencing of server data copy jobs and routing of the requests through a series of load balancers and proxy servers, to ensure it took the correct path out of the network.
During the first month of the project, our network engineering team was running hard to stay ahead of the data copy (and other tasks) that were just about to start. If they had not delivered on time, the whole project would have been at risk.
Could we run split-site?
Up to this point, our application had only run within a single datacenter. In this environment, latency (round trip delay) between nodes (servers) was always sub-millisecond. If we were going to successfully split the running of our application between our existing datacenter and GCP, we needed to know how the application would perform with 20 – 50 millisecond latency between nodes. This increase in latency is due to the speed of light and the hundreds of miles between our datacenter and GCP that the data packets have to traverse.
Obviously, this was not something we wanted to have issues with right in the middle of our migration. To reduce potential customer pain, we wanted to test it out first. During the planning phase of the project, we decided to use a server-side tool (tc) to introduce artificial network latency and simulate the expected geographic, speed-of-light latency. We did so by progressively ramping our NoteStore fleet up to 50ms of simulated delay and left it in place for 4 days. During this time, we monitored our application KPIs and compared them with our baseline data. Most of our API calls were slightly slower but still within acceptable bounds and did not affect user experience.
This was a huge milestone for the project: we had validated that our application could successfully run in a split site configuration. This meant that our ‘Accelerated Phased Cutover’ idea might just work.
Load balancing (physical to HAProxy)
In our datacenter, we ran and managed a high-availability cluster of traditional load balancing devices. As we were moving into a cloud environment, having physical load balancers was not an option, so we started investigating virtual load balancer solutions.
In an ideal world, we would have deployed a single load balancing (LB) layer based solely on GCP services. This was not an option for us due to our reliance on inspection of cookies, headers, and URL patterns to route requests to the proper shards. This parsing is not something we can currently do within the Google Load Balancing platform alone. We evaluated and benchmarked the options and settled upon a solution built with Google’s Network Load Balancer product and a Linux-based HAProxy farm.
The Google Network Load Balancer would be our entry point for customer traffic and would balance the traffic to our HAProxy farm that would perform the specific routing of customer traffic to their specific shard.
Having completed all of the usual lab testing and validations, we wanted to test our new solution using real-world traffic without having to ‘swing’ all that traffic through the new front door. We needed a way to do partial/phased testing, only increasing the test user set once the previous test was successful. At this point, the backend of the Evernote service was still running within our old, physical datacenter and the split-site LB traffic would be routed there via our private VPN links. We decided to do the following:
- Direct Evernote employees via the new ‘front door.’ We updated our corporate DNS to point the Evernote employees to the new front door.
- Utilize Dyn Traffic Manager to incrementally ramp-up user traffic to use the new front door.
Using these two methods, we were able to test our new load balancing platform before any of the other services had been confirmed as successfully running in GCP. As with our split site testing, it was a relief to be able to complete testing of components individually and start to gain confidence that it would work.
Reco service (UDP -> PubSub)
When you add an attachment or resource to Evernote, the service attempts to extract textual information from it if it’s an image or PDF. The service that performs that action is called Reco (short for the word ‘recognition’).
Due to various architectural limitations in the past, the Reco servers used a polling pattern to get a list of new resources to process. However, as one would imagine, having multiple Reco servers periodically polling every NoteStore server could lead to a significant overhead on the NoteStores and the resource datastores supporting them. This was compounded as more Reco servers would be added to support the ever-increasing rate of new attachments.
To reduce the overhead and the subsequent latency that occurred as more Reco servers were added, the Reco servers were redesigned to use Multicast to learn when new resources on a NoteStore were available.
However, Multicast is not supported in GCP Compute Engine networks. As a result, we redesigned the application to have a different communication architecture.
We re-architected the application and removed the necessity to track jobs and broadcast the statuses of NoteStores with attachments to recognize. Instead, we decided to use a very scalable and reliable queuing mechanism, PubSub. The NoteStores now notify Reco servers of work to do by producing a job in a PubSub queue. Each Reco server consumes new additions by simply subscribing to a particular PubSub queue and acknowledging when they’ve completed the recognition job on the resource. We also support the ability to have priorities by creating multiple queues and have Reco servers prioritize processing of resources based on the priority of the channel.
In this way, we significantly simplified the architecture by using a cloud-based, queuing mechanism and redesigning the application to rely on both the availability of jobs in the queue and speed of notification. Furthermore, in the interest of only managing and maintaining resources that are in use, we are in the process of migrating the Reco service to support autoscaling.
User attachment storage (from multiple WebDavs to Google Cloud Storage)
We had 12 billion user attachments and metadata files to copy from the original WebDavs to their new home in Google Cloud Storage.
Considering the volume of data that we needed to copy, this was on the project’s critical path. We wasted no time in starting this massive data copying effort in the background. The service was still reading from and writing to the existing WebDav farm, while we copied the resources to their new home in the background.
One of the first obstacles we had to work around was the fact our current datacenter network was not designed to copy hundreds of terabytes per day across thousands of nodes. As a result, engineering time was required to establish multiple secure egress paths to GCP’s network (detailed above). Since we were going to be copying from hundreds of servers at a time, we also had to be careful to not DDoS ourselves and ensure we protected the service for our users.
The resource-migrator
We developed a Java application that would run directly on each file server (WebDav). The WebDavs were split into directory trees based on their physical RAID arrays. The resource-migrator would walk the directory trees and upload each resource file to Google Cloud Storage (GCS).
To ensure the successful upload of a given resource, we passed a locally computed hash along with the contents of the file to the GCS API. GCS has a feature that independently computes its own hash and compares it to the provided hash. In the event of a mismatch, the GCS API returns a HTTP 400 BAD REQUEST error code and the resource-migrator would retry. If repeated errors occurred, the failure would be logged as something to be fixed later and the resource-migrator would move on.
Through performance testing, we found that we were bound by both IOPS (Input-Output Operations Per Second) on the RAID arrays, as well as, CPU of the WebDav at large. To avoid impacting the customer experience, we found the right balance of two concurrent instances of the resource-migrator (one instance for each RAID array), each running at 40 threads in parallel. This allowed us to comfortably upload 80 resource files concurrently without negative impact to production user traffic.
Now that we had our worker code in the resource-migrator, we then needed to create a control layer to manage this at scale, enter the migration-orchestrator.
The migration-orchestrator
The resource-migrator itself is a small application that must be started and stopped for each directory tree in our WebDav clusters. Given the hundreds of trees to migrate, we needed a control layer to orchestrate the migration across the fleet.
Using shell scripts, we were able to integrate with our existing inventory and fleet management tools to track, start, stop, and resume instances of the resource-migrator across our fleet of WebDavs.
Given the constraints of no more than two instances per WebDav and no more than 20 instances per physical server cabinet (due to network constraints), the migration-orchestrator had to be datacenter aware and have the ability to intelligently start/stop/resume n instances of the resource-migrator, given a set of cabinets with minimal human hand-holding or effort.
At a high level, the migration-orchestrator needs to:
- Provide a central console to manage all resource-migrator jobs (start/stop/restart)
- Maintain a list of all jobs and identify candidates for migration (filesystems that were actively taking writes were not eligible to be migrated at first)
- Be aware of datacenter and host awareness in order to avoid overloading resources and impacting production traffic
- Provide consistent 24/7 throughput and concurrently running jobs
At full speed, we were able to run between 100 and 120 instances of the resource-migrator in parallel, all controlled by the migration-orchestrator.
Updating our application to talk to GCS
We then needed to consider how to update our application code to use GCS to read and write resources rather than WebDav. We decided to add multiple switches that allowed us to turn on and off specific GCS read/write features. These switches could also be enabled on a subset of shards, allowing us to rollout updated code in a safe and controlled manner.
Cutting the service over
Moving the service to a new storage backend was a sensitive operation. We did extensive testing in our development and staging environments, but there’s only so much that can be validated until throwing real production load at a system.
To do this cutover cleanly, safely, and with minimal user impact, we broke down the transition into several isolated steps. To understand this, first let’s explain the original flow for uploading a resource:
- A user sends a request to save a resource to the Evernote service.
- The service receives the resource and kicks off two processes:
- Ensures the resource is written to two primary storage WebDavs
- Adds the resource to an asynchronous job queue
- Only after these two operations have been completed does the service respond to the user that the resource has been successfully saved.
- The asynchronous job is then processed in the background and the resource is copied to a third, remote WebDav in our offsite backup facility.
Phase 1
Asynchronous GCS Writes
The first step we took was to add a job to the asynchronous backup queue to write resources to GCS. This benefited us in several ways:
- New resources would automatically be uploaded to GCS, saving us time with the resource-migrator.
- Additionally, this enabled us start pushing production levels of Write traffic to the GCS API. This was critical in testing our own code paths to ensure they worked as desired at large volumes. This also gave us insight into the failure cases and rates of GCS.
- Most importantly, all of this occurs completely outside of the user’s critical path. Any failures here would be 100% transparent to the user. This gave us the flexibility and freedom to iterate and improve our code without negatively impacting the user experience.
Opportunistic GCS Reads
At this point, we were confident in the Write performance and success rate of our code and GCS. The next step was to reach the same level of confidence with Read performance and success.
To accomplish this, we opportunistically read resources from GCS. If a given resource had not been migrated yet, or if there was a failure of any kind, no problem — the service would immediately fall back to reading from the primary WebDav locations.
Since this is in the user’s critical path for downloading resources, there was a chance of slightly increased latency for requests that to GCS that failed as the fallback occurred. Generally speaking, any such delays were measured in milliseconds, and would not have been noticeable to users.
Initial learnings
In these first two rounds, we learned a lot about how our new, storage backend and code performed. We quickly uncovered some rare failure cases (1:1,000,000 or so) and determined that the built-in retry logic within the GCS Java SDK was not sufficient for our needs and the level of service we desire to provide our users. We were able to quickly iterate on our code to handle these failure cases to provide a more robust backend service.
Phase 2
After learning and iterating through several cycles, we were ready to fully commit. The next phase was to sever our reliance on our old WebDavs. We were still in the process of migrating the resources, so we had to leave the opportunistic Reads in place, but it was time to close them up for new resources.
GCS Writes in the critical path
First we had to modify the resource upload flow. Where a resource was previously uploaded to two primary WebDav locations before confirming the write, we now uploaded directly to GCS and reduced the WebDav location count to one. At this point:
- If GCS had a critical failure and all retries failed, this would ultimately signal a ‘failed upload’ error to the user. (our clients will auto-retry, so don’t worry!). This reflects the end-state we will be in when everything’s said and done.
- While we reduced the number of WebDav Writes, this does not decrease the durability of data. GCS provides greater data redundancy than our previous system of WebDavs, so this is a big win here!
- In the immediate sense, reducing the WebDav count helps keep the request latency down (two Writes, vs three), while still providing us the peace of mind of maintaining a primary Webdav copy during the transition.
Disabling WebDav Writes altogether
After letting the system run in this state for some time and closely monitoring performance and health, we were ready to stop WebDav writes all together:
- Disable the asynchronous backup job to our offsite location (Remember, GCS not only maintains multiple copies of a given resource, but also stores them in geographically diverse areas, providing us with greater resiliency in the event of catastrophic failure or natural disasters).
- Disable Primary WebDav Writes.
At this point, the upload flow was simply:
- A user sends a request to save a resource to the Evernote service.
- The service receives the resource and ensures a successful Write to GCS.
- The service responds to the user that the resource has been successfully saved.
In the background, GCS is fully handling storing of multiple copies of the file in geographically diverse regions.
Final verification
At the time of writing, this resource migration is at 99.2% complete‚ it’s the long pole in the metaphorical migration tent. Let’s just say the ‘refresh’ button has been getting a workout!
The opportunistic GCS Reads with WebDav fallback will remain in place not only until the migration is complete, but until we run through one final sanity check to ensure that everything was moved successfully.
To perform this check, we will iterate through our databases of resources and query GCS with a “do you have this?” request to verify the resource both exists and validates the stored file hashes one last time. Compared to the resource migration, this should be relatively quick, on the order of weeks instead of months as we check each of the 12 billion files to ensure your data is safe and secure.
Testing a production shard running in the cloud
Again, there were multiple steps that led up to this. Earlier in the project we had created multiple iterations of the full Evernote technology stack running in GCP to test functionality. What we needed to do before considering the full migration was prove that a shard could run successfully in the cloud while serving full user load. We wanted to shake out any edge cases that the previous testing might have missed.
Just before Thanksgiving weekend, we successfully migrated two active user shards into GCP and had them running there for a couple of hours before moving them back to the datacenter. We had passed our final, blocking, pre-migration check and had a green light for the migration.
In Part 4 we discuss Planning and Executing the Migration
If you have any followup questions please join us on the Evernote Forums.