Selecting a model for semantic search at Dropbox scale

Nautilus is our search engine for finding documents and other files in Dropbox. Introduced in 2018, Nautilus uses a conventional keyword-based approach that, while functional, has some inherent shortcomings. Because Nautilus has limited contextual understanding of what someone may be looking for, users are required to precisely recall a file’s exact name or the specific keywords within. For instance, a search for “employment contract” may overlook relevant “job agreement” or “offer letter” documents, as Nautilus did not grab their contextual similarity. And for multilingual users, Nautilus expects queries and documents to be in the same language, hindering efficient retrieval when dealing with content in different languages.

To mitigate these limitations, we considered techniques such as stemming, spelling correction, and query expansion for improved flexibility. However, we wondered if we could elevate the Dropbox search experience further. Could it be possible to help users find their content without needing to know the exact search term?

Enter semantic search. Rather than rely on exact keyword matches, semantic search aims to better understand the relationship between user queries and document content. This functionality ultimately enables Dropbox users to locate crucial information more quickly, so they can spend less time searching and more time focusing on the task at hand.

For multilingual users, semantic search also unlocks another capability: cross-lingual search. This advanced feature allows users to search in one language and receive relevant results in other languages, further enhancing accessibility and usability.

We’re excited to share that Dropbox now supports semantic search (powered by Nautilus), adding the aforementioned capabilities. We rolled it out for Dropbox users internally in early 2024 and then externally as an experiment for a subset of Pro and Essential users in May 2024. And with this release, we observed a nearly 17% reduction in empty search sessions (measured by ZRR, or zero-results rate), and a 2% lift in search session success (measured by qCTR, or qualified click-through rate). Based on these positive results, we decided to make semantic search generally available to all Pro and Essential users in August 2024, and coming soon in early 2025 for Business users.

Staying true to our AI Principles, we performed the evaluation of different embedding models in-house with pre-trained models, and we did not train on any user data. Below, we’ll introduce the key concepts behind semantic search, and walk you through our approach for selecting one essential component: the text embedding model.

Dropbox Dash: Find anything. Protect everything.

Find, organize, and protect your work with Dropbox Dash. Now with advanced search for video and images—plus generative AI capabilities across even more connected apps.

See what's new →

Understanding semantic search: A new paradigm

Semantic search is designed to retrieve information based on meaning and intent, going beyond the limitations of simple keyword matching. While the terms "semantic search" and "vector search" are sometimes used interchangeably, within the context of this post we define "semantic search" as the entire flow—from user query input to search results output—and "vector search" as the specific step of retrieving items based on vector similarity. We plan to introduce the concept of semantic search for multiple file types, but in this first iteration of semantic search at Dropbox, we focused on supporting only text files.

At its core, semantic search relies on vector search as the underlying technology to interpret and process unstructured data such as text, images, audio, and video. This process involves transforming content into numeric representations called embeddings, which capture the rich features of the data in a high-dimensional space. These embeddings, or vectors, are dense lists of numbers encoding features across hundreds or thousands of dimensions. During training, machine learning models refine these dimensions to capture nuanced patterns—whether visual details, document structure, or associations among words. While individual dimensions are often abstract, certain aspects may implicitly capture features like sentiment, syntax, or relationships between concepts. This dense representation enables a context-aware understanding that drives more meaningful retrieval.

Vector search provides the flexibility to store different types of embeddings, for different file types, based on the specific needs of the application. By selecting or training models that capture the most relevant features for the task at hand—such as visual characteristics for images or semantic relationships for text—we can tailor the embeddings to support a wide range of content types and use cases.

Building on this foundation, semantic search operates by transforming user queries into embeddings and then performing vector search to retrieve results that align with the query's intent, rather than its literal terms. The search begins when a user enters a query, which is converted into an embedding and compared against stored embeddings using vector search. Related items naturally cluster together in the vector space, even when the formats differ—whether text, images, or diagrams. Using nearest-neighbor algorithms, vector search identifies results based on meaning, ensuring that the retrieved content aligns with the user's intent. For example, a search for “guides and resources” could surface instructional PDFs, training videos, and visual diagrams—regardless of the exact labels for each item.

This similarity-based retrieval approach unlocks a wide range of possibilities, from linking related documents and multimedia to uncovering insights across diverse content. Vector search enhances our interaction with information by establishing meaningful connections between items, supporting a flexible and integrated search experience. Through its ability to understand meaning and context, vector search provides the technical foundation for a search system that can adapt to varied applications, delivering more relevant and context-aware results.

Identifying the right model for our purposes

Central to semantic search—and for our use case, for text documents—is the document embedding model, which maps each document and query to the respective embeddings. Implementing semantic search at Dropbox scale would require mapping both new and existing documents to their embeddings and storing them in our indices. For a search platform as large as Dropbox, indexing more than a trillion documents and exabytes of data, could get very expensive computationally! Overall performance—in terms of both latency and quality—would hinge on the model we used.

Since computing the embedding for a user’s search query would now be in the critical path for retrieving their documents, speed would be paramount. Also, a faster model would mean we could index our users’ documents faster, leading to time and cost savings down the line.

At the same time, the quality of our embeddings needed to be high enough to benefit from the vector search paradigm. Whatever document embedding model we chose would need to generate embeddings that were similar enough to match documents with their relevant queries, but still discriminative enough to filter out irrelevant documents.

We decided to perform a thorough evaluation of available models to find the one most suitable for our purposes, measuring each model with speed and search quality in mind.

Integrating the Massive Text Embedding Benchmark

Fortunately, assessing document embedding models is an undertaking that has already been embraced by the research community. The Massive Text Embedding Benchmark (MTEB) is an open-source benchmark that assesses document embedding models across eight evaluation tasks and 56 datasets, some of them multilingual. MTEB has published over 2,000 results to its leaderboard since its release in 2022.

We adapted the MTEB benchmark and leveraged the reranking task to fit our needs as follows:

Customizations for Dropbox infrastructure. We added adapters to enable the evaluation of models running in our in-house inference services, in addition to models executed inline. We also added adapters to allow streaming datasets that reside in our infrastructure.
Multiple embeddings per document. MTEB by default presumes that a single embedding is generated per document, as the size of documents in the public datasets are fairly consistent. However, in production, our user’s document can range from very tiny to very large, leading to the hypothesis that we should try generating multiple embeddings per document. We implemented various nuanced strategies for separating, or chunking, a given document into individual chunks. Chunking is further configurable with a parameter for specifying overlap between consecutive chunks. We also explored summarization, in order to ensure that we respect the model’s internal threshold on input size. The resulting embeddings can then be used directly or in aggregate.
Optimizations for storage and precision. To optimize storage costs, we reduced the precision (full 32-bit floating point, half float, quarter float, fixed point of various bit depth) and dimensionality (via Gaussian random projections) of MTEB's full-sized embeddings.
Files versus documents. Whereas the public datasets in MTEB consist of unnamed documents identified by their content, documents in Dropbox are named by our users—and filenames are quite significant in retrieval tasks. We crafted various approaches to incorporate embeddings of the filenames into MTEB when applicable.

Constructing novel datasets for evaluation

Among the evaluation tasks that MTEB offers, we were most interested in re-ranking and retrieval tasks. While the performance of various models across the two tasks were readily available on the leaderboard, these two tasks primarily offered English-only datasets, which does not match the distribution of the Dropbox corpus. To address any potential distribution shift between the public datasets and the Dropbox corpus, we leveraged our existing ML-powered search platform Nautilus to construct our own MTEB-compatible datasets for more precise evaluation.

We created a Kubeflow pipeline to generate a custom dataset purely for the purpose of evaluating the different embeddings models and configurations. With this pipeline, we extract anonymized query-document pairs from Dropbox search logs and put them in an MTEB-compatible format within our in-house service for hosting L0 datasets. We have strict security and access controls in place to ensure that only Dropboxers who need access to this data can access it. Moreover, as per our data retention policies, the datasets get deleted after 30 days.

We also extended the pipeline to unlock multilingual evaluation. At the time of our benchmarking, the public retrieval datasets were English-only (nowadays, MIRACL dataset bridges this gap). In order to diversify our evaluation criteria beyond English, we pioneered a set of multilingual datasets (Spanish, French, German, Japanese, Korean) from Dropbox search logs.

Evaluation and model selection

From an exhaustive evaluation of 11 models—including four multilingual ones—we selected multilingual-e5-large as the top performer. This model not only excelled on our Dropbox datasets, but at the time of benchmarking also stood out as the best multilingual model on the MTEB public leaderboard across various tasks.

Below are our benchmark results on our custom datasets for multilingual models, using a configuration of two embeddings per document: one for title with path, and another for the first chunk of the text content. Metrics reported are mean reciprocal rank (MRR) and mean average precision (MAP), where higher is better.

Model	English	Japanese	Spanish	Korean	German
paraphrase-multilingual-mpnet-base-v2	MRR: 0.3299 MAP: 0.3462	MRR: 0.2245 MAP: 0.2448	MRR: 0.2367 MAP: 0.2568	MRR: 0.2338 MAP: 0.2546	MRR: 0.2879 MAP: 0.3078
paraphrase-multilingual-MiniLM-L12-v2	MRR: 0.3108 MAP: 0.3278	MRR: 0.2628 MAP: 0.2804	MRR: 0.2043 MAP: 0.2273	MRR: 0.2374 MAP: 0.2584	MRR: 0.2355 MAP: 0.2533
multilingual-e5-large	MRR: 0.5044 MAP: 0.5133	MRR: 0.4265 MAP: 0.4386	MRR: 0.3350 MAP: 0.3524	MRR: 0.4003 MAP: 0.4118	MRR: 0.3305 "map": 0.3432
multilingual-e5-base	MRR: 0.4492 MAP: 0.4603	MRR: 0.3659 MAP: 0.3795	MRR: 0.3330 MAP: 0.3511	MRR: 0.3817 MAP: 0.3957	MRR: 0.3405 MAP: 0.3535

Model

English

Japanese

Spanish

Korean

German

paraphrase-multilingual-mpnet-base-v2

MRR: 0.3299
MAP: 0.3462

MRR: 0.2245

MAP: 0.2448

MRR: 0.2367

MAP: 0.2568

MRR: 0.2338

MAP: 0.2546

MRR: 0.2879

MAP: 0.3078

paraphrase-multilingual-MiniLM-L12-v2

MRR: 0.3108

MAP: 0.3278

MRR: 0.2628

MAP: 0.2804

MRR: 0.2043

MAP: 0.2273

MRR: 0.2374

MAP: 0.2584

MRR: 0.2355

MAP: 0.2533

multilingual-e5-large

MRR: 0.5044

MAP: 0.5133

MRR: 0.4265

MAP: 0.4386

MRR: 0.3350

MAP: 0.3524

MRR: 0.4003

MAP: 0.4118

MRR: 0.3305

"map": 0.3432

multilingual-e5-base

MRR: 0.4492

MAP: 0.4603

MRR: 0.3659

MAP: 0.3795

MRR: 0.3330

MAP: 0.3511

MRR: 0.3817

MAP: 0.3957

MRR: 0.3405

MAP: 0.3535

While our evaluation primarily focused on re-ranking and retrieval tasks, we knew that other teams at Dropbox might be interested in using MTEB to evaluate additional tasks. To extend its functionality beyond vector search, we linked MTEB with the rest of our infrastructure and added additional key parameters for evaluation. This means that other Dropbox initiatives—such as Dropbox Dash, or our AI-powered file summaries and conversational features—can now leverage MTEB to evaluate various document embedding models for their applications as well.

Putting the model into production

In order to make the best use of our storage and compute resources, putting multilingual-e5-large into production required some tradeoffs.

Based on available storage capacity, we started by setting an upper bound of 4KB of vector-search-related metadata per document. This gave us room to adjust the number of embeddings per document, as well as the dimensionality of those embeddings and their numerical precision.

In compression experiments, reducing precision to 8-bit per channel resulted in a manageable 1KB per embedding with a marginal impact on quality. However, reducing dimensionality adversely affected quality. Given that two embeddings fit our storage constraints, we opted to maintain the full dimension of the embeddings to preserve data integrity.

The final quantization format we adopted is a slight variation over the standard 8-bit: we first scale the embedding so that the maximum over the magnitudes of individual channels is exactly 1.0. We store the scalar separately as a 32-bit float (4 bytes)—and the scaled embedding, which lies in [-1, 1], can be converted to 8-bit fixed point by remapping the range to 8-bit signed integer and rounding. We found that this scheme minimized the error on cosine similarity over the query-document pairs in our dataset.

As for managing our compute resources, we had to consider:

The maximum number of characters per document
Constraints on the number of document chunks
How to balance chunk sizes to maintain contextual relevance without an overwhelming amount of processing
Different document embedding strategies (file path and/or content embeddings)

Our findings ultimately favored a dual approach: storing separate embeddings for the file path (including path and filename) and the document content up to some limit (512 tokens). While this method doesn't encompass the entire document, focusing on the initial 512 tokens significantly lowered costs and processing demands with just two embeddings per document.

A more relevant Dropbox search experience

Search engines have come a long way from simple keyword-based queries. With vector search, we can use machine learning to actually figure out the meaning behind queries and content. This makes it much easier for users to find the files or data they need quickly. It also doesn’t matter what language they use—whether in their queries or content—which is a big deal.

Getting to this point took a lot of careful planning and decision-making. By using multilingual-e5-large with just two embeddings per document, we think we’ve struck the right balance between search quality, speed, and efficiency. But it's not just about being faster. The real value of vector search is in helping our users work more effectively, get more done, and find Dropbox easier to use. We’re also not done yet. We’re committed to giving ours users the best tools for modern work, and you can expect to hear more about our efforts to make search in Dropbox even better.

Acknowledgements for this project (in alphabetical order): Aditya Jayaraman, Alex Yin, Jongmin Baek, Kedar Rudre, Marta Mendez, Matt Barta, Mingye Xia, Morgan Zerby, Muye Gao, Prasang Upadhyaya, Sarah Andrabi, Yidi Zhang, Zhangfan Dong.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit dropbox.com/jobs to see our open roles, and follow @LifeInsideDropbox on Instagram and Facebook to see what it's like to create a more enlightened way of working.

// Tags

// Copy link

Link copied