Evernote Indexing System is designed to extend Evernote search capabilities beyond text documents into media files. Its task is to peruse through those files and bring any textual information into the searchable domain. Currently it processes images/PDFs and digital ink documents, with provisions to extend the service to other media types. The produced index is delivered in the form of an XML or PDF document, containing recognized words, alternative spellings, associated confidence levels and their location rectangles.
The Indexing System is implemented as a farm of dedicated Debian64 servers, each running an AMP processor and multiple ENRS processes – usually per number of the CPU cores of the server. ENRS (EN Recognition Server) is implemented as a set of native libraries wrapped into a Java6 web server application. It currently houses two components — AIR and ANR, the first of which handles various image types and PDFs and second is dedicated to the digital ink documents. AMP communicates with the servers through simple HTTP REST API which allows flexibility of the system configuration while maintaining high throughput essential for passing over large media files.
AMP retrieves resources from the user store shards and return back the created indexes. These will be included into the search index for EN Web Service, and passed over to Evernote phone/desktop clients to facilitate in-media searches locally. To minimize the extra traffic imposed on the shards already busy with user requests, AMPs broadcast queue information to each other, forming a single distributed media processor optimized for the current EN Service load and processing priorities. Evernote Indexing System is resilient enough to be operational even if only one of each type of the components will remain functional (and currently there are 37 AMP processors and over 500 ENRS server processes in operation processing around 2 million media files a day).
Let’s have a closer look at the AIR part of the ENRS server. AIR’s reco philosophy is different from the ubiquitous OCR systems as its goal is to produce a comprehensive search index — instead of a printable text. Its focus is on finding as many words out of any kind and quality of an image as possible. Also, it has the flexibility to produce alternative readings for incomplete, unclear, blurred words.
To deal with the real-world images, AIR server does its processing in multiple ‘passes’, focusing on different assumptions in each of them. The image may be huge, but contain just a few words. It may contain scattered words at different orientations. Fonts may be very small and quite large in the same area. Text may alternate between black-on-white and white-on-black. It could be a mix of a different languages and alphabets. For Asian languages horizontal and vertical lines may be present in the same area. Similar-intensity font colors may blend into same gray levels under standard OCR processing. Printed text may include handwritten comments. Ad material art may be warped, slanted, changing size on the go. And that’s just to name a few problems that AIR servers currently face about two million times a day.
Below is a diagram of the main building block of the AIR server — a single ‘pass’. Depending on the call parameters, it will specialize on a different kind of processing (scale, orientation, etc), but the general scheme stays the same. It starts with the preparation of the set of images specific to the pass – scaled, converted to gray, binarized – depending on the pass. Then image graphics, tables, markup and other non-text artifacts need to be removed as much as possible to let the system focus on actual words. After candidate words are detected, they are assembled into proposed text lines and blocks.
Each line of each block will then pass through analysis by a number of recognition engines – these include ones developed internally and licensed from other vendors. Employing multiple reco engines is important not only because they specialize on different types of text and languages, but also as this allows to implement ‘voting’ mechanism — analyzing word alternatives created by diverse engines for the same word allows for better suppression of false recognitions and giving more confidence to consensus variants. Those confident answers would become pillars on which the final block of the ‘pass’ processing would base its text line reconstruction – re-deciding the structure of text lines, word segmentation, and purging most of the less-confident variants to reduce search false positives.
The number of passes to make will be determined initially by the image rendering and analysis module, but as recognition progresses this number may be increased or reduced. In case of a clean scan of a document, it may be enough to run only the standard OCR processing and be done with the whole process. A snapshot of a complex scene taken by a phone camera under poor lighting conditions may require deep analysis, with full set of passes to retrieve most of the textual data. Lots of colored words on a complex background may require additional passes specifically tailored to color separation. Presence of small blurred text will require expensive reverse-digital-filtering technics to restore the text image before attempting any reco processing. And once all passes are complete, it will be time for another critical part of the AIR processing to take stage – final results assembly. On complex images different passes may have produced wild variety of interpretations of the same areas. All these conflicts need to be reconciled, best interpretations selected, most of the incorrect alternatives need to be rejected and final blocks and lines of text built.
Once the internal document structure is finalized, it is only the last step left to create the requested output format. For PDF documents it is still PDF, where images are replaced with text boxes of recognized words. For all other input documents it is an XML index, containing the list of recognized word and their bounding boxes or stroke lists (for digital ink documents). This location info will allow to highlight the searched word over the source image or text of an ink document once a user will look for the document containing it.