Relaxing the search to approximate distance calculations can lead to significant latency improvements, but we need to minimize negatively impacting search accuracy (i.e., relevance, recall). This decoupling means all candidate item embeddings can be precomputed, reducing the serving computation to (1) converting queries to embedding vectors and (2) searching for similar vectors (among the precomputed candidates).Īs candidate datasets scale to millions (or billions) of vectors, the similarity search often becomes a computational bottleneck for model serving. To this end, the two-tower architecture offers one more advantage: the ability to decouple inference of query and candidate items. While these capabilities help achieve useful embeddings, we still need to resolve the retrieval latency requirements. Source: Announcing ScaNN: Efficient Vector Similarity Search Calculating the dot product between the two embedding vectors determines how close (similar) the candidate is to the query. During model serving, retrieve relevant items fast enough to meet latency requirementsįigure 2: The two-tower encoder model is a specific type of embedding-based search where one deep neural network tower produces the query embedding and a second tower computes the candidate embedding.During model training, find the best way to compile all knowledge into embeddings.To optimize this retrieval task, we consider two core objectives: The goal of the first stage (candidate retrieval) is to sift through a large (>100M elements) corpus of candidate items and retrieve a relevant subset (~hundreds) of items for downstream ranking and filtering tasks. To meet low latency serving requirements, large-scale recommenders are often deployed to production as multi-stage systems.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |