Update: October 4 outage, a big migration, and a better infrastructure

Yesterday, Evernote experienced an outage between 4:00 am and 6:30 am EDT. By 9:30 am, all logged-in users could use the app as normal. In an effort to limit malicious behavior, we temporarily limited the number of login attempts that could be made from a single account. At 11:30 am today, October 5, we removed the limit on login attempts.

As a result of these actions, some Evernote customers may have experienced the following issues: the inability to log into Evernote, the inability to sync changes, and an overall unresponsive and slow product experience.

Why did this happen?

A particularly important (and tricky) migration

The outage was triggered when we migrated a large database responsible for managing logins, new registrations, and information related to the user profile away from Evernote’s legacy backend infrastructure—the Monolith.

Over the last few months, we have migrated countless pieces of the backend infrastructure to a more modern and reliable system. Yesterday, we performed the final Monolith data migration. With the completion of this last piece, we can now move forward with transitioning the whole system to a new infrastructure and finally bring Evernote’s backend foundations to a much healthier state.

Moving away from the Monolith in favor of a more decentralized infrastructure has been a goal for the Evernote team for many years, and a few months ago we finally took on this daunting, yet essential, task.

If the name Monolith sounds familiar to you, it’s because I mentioned it a while back when discussing how we’re future-proofing Evernote’s foundations. Essentially, the Monolith is a huge, single Java Enterprise Edition application that powers the storage and synchronization of multiple entities within Evernote. The Monolith is extremely heavy, difficult to manage, and has accumulated significant technical debt over the years.

This last migration was by far the biggest and most technically challenging Monolith migration to date, and also one of the most significant. With it behind us, we’ve finally unlocked enormous potential to make quicker and more impactful improvements going forward.

A deep dive into the outage

In an effort to minimize service disruption, we planned to migrate the database at 3:30 am EDT on October 4. However, as we finished the migration and reconnected the clients, the Monolith wasn’t able to process a higher-than-usual number of inbound requests. Unfortunately, we were forced to restart the entire system and gradually reintroduce traffic.

Here’s the timeline of what happened (all times refer to Eastern Daylight Time):

October 4

03:30 am

  • The team began the planned migration

04:00 am

  • With the migration complete, our servers were overwhelmed by the number of inbound requests as the clients attempted to reconnect
  • The Monolith was unable to scale to meet the higher demand
  • We took immediate action and made the decision to temporarily block client requests and allow the servers to restart, which led to the outage
  • During this time, customers may have experienced the following issues:
    - A slow and unresponsive in-app experience
    - The inability to sync notes, seeing either an ineffective sync status or an error message stating that sync is temporarily paused
    - (If not logged in) The inability to log into Evernote, with either a 429 Error (Too many requests) or a message stating that Evernote was temporarily unavailable

06:30 am

  • Once the servers were restarted, we began allowing the clients to connect again, prioritizing the note saving functionalities
  • At this point, some users started seeing their Evernote functionality return to normal
  • Over the next three hours, we methodically and gradually increased traffic until the majority of Evernote’s core functions were restored

09:30 am

  • All logged-in users could use the app as normal
  • In order to continue to protect our system, we temporarily limited the number of login attempts that could be made from a single account
    - This was done to prevent potential bad actors from attacking our system and compromising its stability or the security of our customers' data
    - Unfortunately, these rate limits apply to bad actors and legitimate users alike—as a result, some customers attempting to log into Evernote may have experienced difficulties

October 5

11:30 am

  • We introduced abuser-specific rate limits and returned global rate limits to their normal threshold
  • We observed positive changes in Evernote’s performance compared to before the outage, meaning the migration had its intended effect
    - The overall latency is lower than before the database migration
    - RTE is registering a lower error rate than before the database migration

As of this time, the outage is over and the migration is considered complete—if you’re still experiencing login or sync issues, please contact our team to receive direct support

A better process going forward

With the incident behind us, we’re performing a full retrospective of the events leading up to and during the outage, so we can improve our processes in the future. This reflection won’t just focus on technical aspects, but on our communications as well—we should have been much more effective and timely in communicating with you.

In particular, we’ll ensure that status.evernote.com is updated immediately in these circumstances. We did not update that page until 10:30 am yesterday (hours after the issues began), and many confused customers reached out to me after experiencing problems without any further information available on the website. For this, I apologize.

Outages are never pleasant, and we appreciate your patience as we’ve worked on finding a resolution. In this case, the outage was an unfortunate consequence of an otherwise exciting development for the future of Evernote. I hope to share more such (positive!) updates with you soon.

– Federico