For a company which stores hundreds of billions of files, search is vital for Dropbox, both for its customers and internal-facing. As a result, the storage provider has overhauled its search with machine learning capabilities.
The new platform, called Nautilus, had four goals on its launch; delivering top class performance, scalability and reliability, providing intellient document ranking and retrieval, flexibility for customising document-indexing and query-processing pipelines, and wrap it all up in a reliable, secure package.
The architecture is based at a high level on indexing and serving. Indexing, naturally, is a key factor of any search, collecting, parsing, and storing data for retrieval. The serving function uses the index to return results from user queries. This is by no means uncommon, but with the sheer scale involved, more needs to be done. Dropbox generates 'offline' builds of the search index every few days on average, and puts together 'index mutations' that can be applied to both the live index and a persistent document store in almost real-time – to approximately a few seconds.
Where the machine learning element comes in is through search ranking. Compared with Dropbox's retrieval engine, which returns a large set of matching documents 'without worrying too much about how relevant each document is to the user', as the company puts it, ranking aims to predict items the user wants at that moment.
"The ranking engine is powered by a ML model that outputs a score for each document based on a variety of signals," wrote Diwaker Gupta, engineering manager at Dropbox, in a blog post. "Some signals measure the relevance of the document to the query, while others measure the relevance of the document to the user at the current moment in time."
As can be expected with ML, the system can learn as it goes along, while the company is at pains to note that no personally identifiable information – rather, anonymised 'click' data – is used.
"The main advantage of using an ML-based solution for ranking is that we can use a large number of signals, as well as deal with new signals automatically," added Gupta. "For example, you could imagine manually defining an 'importance' for each type of signal we have available to us. This might be doable if you only have a handful of signals, but as you add tens or hundreds or even thousands, this becomes impossible to do in an optimal way.
"This is exactly where ML shines: it can autoamtically learn the right set of 'importance weights' to use for ranking documents, such that the most relevant ones are shown to the user," said Gupta. "For example, by experimentation, we determined that freshness-related signals contribute significantly to more relevant results."
A further blog noted an interesting aspect in Dropbox's traffic – in that it is dominated by writes rather than reads. In other words, files are updated far more frequently than they are searched for. As a result, the company uses an 'exploded' posting list format. "The exploded representation has the main benefit of handling index mutations particularly efficiently," the company wrote.
This is an interesting development when considering other infrastructure overhauls the company has undertaken. Under Dropbox's S-1 filing released when the company went to IPO earlier this year, 'infrastructure optimisation' was mentioned – in particular, spending two and a half years moving away from Amazon Web Services (AWS) to its own solution, known as 'Magic Pocket.'
Nautilus replaces Firefly, which was Dropbox's search tool for the previous three years.