Evernote Tech

Making Sense of Unstructured Data with Google Cloud Natural Language API

This article was written by Anirban Kundu, Anupom Syam, and Li Wang

Evernote started with the aspiration of building a second brain for our users. The first step on this journey was enabling them to “remember everything” by capturing and accessing their ideas, thoughts, and memories at any time, anywhere. We’re now embarking on the next step of that journey by using Machine Learning (ML) to not only help people archive their thoughts but also to process and take action on them—to think.

ML offers a way to automatically recognize a user’s intent, retrieve the data that matters in the moment, and surface it to that user in a useful context. Using this technology, we foresee a time when the Evernote app can make recommendations, improve organization, and ultimately optimize productivity.

Google Cloud Natural Language API

Since Evernote recently chose Google Cloud Platform as its cloud services provider, we’ve been exploring advanced functionalities such as Google Cloud Natural Language API. In our early testing, we’ve found that Cloud Natural Language API can significantly reduce complexity in our ML pipeline environment by providing syntactical meaning across various languages, and mapping context and meaning to entities when appropriate.

Doing this on our own environment would be a huge effort, not only in terms of handling our data rate, but also in time spent verifying the accuracy of the syntax and entity extraction portions of a Natural Language Processing (NLP) system, along with constantly updating the language packs. But even with a limited training dataset of anonymized data, we have been able to leverage Cloud Natural Language API in coordination with our document classifier and entity extractor to build and train some interesting use cases that have the potential to improve Evernote and enhance the productivity of our users in the near future.

To prove that we could help our users bring the ideas they store in Evernote to life, we’ve experimented with a pair of simple examples we know to be prevalent in Evernote: managing travel itineraries and identifying action items in meeting notes. Using anonymized data, read only by machines, we set out to see if we could both find structure in unstructured syntax and extract semantic intent.

Example 1: Extracting data from airline tickets

In this example, the key question was how to train an effective model with small amount of data (e.g. 20 flight tickets we had on hand), where the model will be able to capture unseen flight tickets from different flight companies, and be able to cope with novel wording and structures.

Flight Ticket in Evernote

First, we needed a way to classify whether or not the note contains flight information. Classification is important because running extraction over all notes would be costly. And while it might be possible for us to collect hundreds of flight ticket examples for training a classification model, this approach won’t scale well. This is because we plan to explore many other categories in the coming months, and it will be expensive to manually collect data in each category. The following flowchart shows the steps we took to tackle this project:

Info Extraction Flowchart

We started by making use of unlabeled data: 13 million anonymized note titles. We first sent these note titles to Cloud Natural Language API for entity analysis, where entities like airline names (in the example shown above, “United Airlines”) can be identified. Then, we built a word2vec model out of the parsed note titles, and constructed a doc2vec model from the small collection of flight tickets we have. The doc2vec model approach makes use of the word2vec ability to conduct linear operations over word vectors. This doc2vec model can be used to represent the flight tickets category, and capture other types of flight tickets which are not present in our limited examples, such as tickets from unseen airlines, or with novel wording and structures.

For example, in our word2vec model, the top words/phrases that are similar to “United Airlines” were “Spirit Airlines,” “Virgin America,” “Delta Air Lines,” “British Airways,” etc. Therefore, even though there are no tickets from British Airways in our sample set of flight tickets, the model we built will be able to identify these potential tickets.

One question that came up was whether it would be effective to build the word2vec model out of public datasets such as Wikipedia, in order to capture the similarity between American Airlines and British Airways. The problem is that when we apply similar techniques to other categories—such recipes and grocery lists—misspellings and special terms will be prevalent. In these cases, the word2vec model built from anonymous note data will be able to offer more accurate information, which won’t be captured by public datasets.

Example 2: Finding actions in unstructured content

Up until now, we’ve discussed how to extract structured content out of unstructured data. In addition to that, we wanted to investigate whether we could extract semantic actions out of unstructured content, such as tasks hidden within freeform text. If successful, such a process could be used to make suggestions that tasks should be added to a to-do list, assigned an owner, and/or given a due date.

To help us identify tasks, we used the Syntactic Analysis capability of Cloud Natural Language API. By analyzing both the parts-of-speech of the words in a sentence and the dependency tree structure (or dependency grammar) of the sentence, we were able to identify whether or not a sentence contains a task. We have made use of the parsing information provided by Cloud Natural Language API in a variety of ways to help achieve this.

In the most trivial form of this example, we attempted to extract tasks by identifying verbs acting as imperatives in the present tense.

Sentence Tree

For example in the sentence above, the system can correctly identify that this is a task that needs to be acted upon, as it recognizes “pick” as the imperative verb and then as one that hasn’t already occurred.

In addition, the system can correctly identify a task even if an imperative verb can also be used as a noun. For example, in the phrase “Address this task tonight”, Cloud Natural Language API can correctly identify the verb as “address,” thus allowing us to easily identify that as an action item.

Address Sentence Tree

Furthermore, since the library assists us in identifying nouns from verbs, it can correctly recognize that there are no actions in the phrase “The address of the White House is 1600 Pennsylvania Ave.”

Address with Number Sentence Tree

Also, the system has been trained to not only identify imperatives without a subject, but to also extract subjects who are being asked to execute a task in the future. For example, in the sentence “Philip will mow the lawn,” we can now not only understand that the task of “mow[ing] the lawn” should be tracked, but also that it should be assigned to Philip. This is possible because we’ve trained the system to analyze the child nodes of the root verb in the sentence dependency tree, and detect children who are auxiliaries like “will,” “shall,” and so forth.

Auciliaries Sentence Tree

On the other hand, if the child nodes of the root verb are passive nominal subject or passive auxiliary, the system doesn’t pick those up as action-inducing items.

We still have a lot of work to do. We’re working on a system now to identify when a task will be due. Also, if the subject was not mentioned in the sentence, but was referred to in the context of the action we want to correctly identify the subject in those cases. Moreover, we still need to improve the accuracy of the model. As such, we are thinking about a UI that would allow our users to provide the system with feedback on tasks that shouldn’t have been caught, or were missed. Lastly, we’re investigating options on how to bring such a feature to other languages besides English.

We at Evernote have been having a great time using and learning about Google Cloud Natural Language API. We have been testing the extraction system on our own company’s meeting notes to see how well it would do and have been pleasantly surprised at the accuracy and the consistency with which we can identify tasks. Similarly, we tried it on our own notes containing flight tickets and found it very accurate and useful. As such, we believe that our users will also enjoy these features. We see potential applications for the same system in other types of notes, such as hotel bookings, recipes, and grocery lists. The extracted information will also enable us to do real-time context driven information re-surfacing and knowledge engine based semantic search. We look forward to continued progress in using these technologies to improve the Evernote experience for all our users in the near future.

Upgrade your notes with Evernote Personal.

Go Personal