ESPN: Memory-Efficient Multi-vector Information Retrieval
Recent advances in large language models have demonstrated remarkable effectiveness in information retrieval (IR) tasks. While many neural IR systems encode queries and documents into single-vector representations, multi-vector models elevate the retrieval quality by producing multi-vector representations and facilitating similarity searches at the granularity of individual tokens. However, these models significantly amplify memory requirements for retrieval indices by an order of magnitude. This escalation in index size renders the scalability of multi-vector IR models progressively challenging due to their substantial memory demands. We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to SSDs and reduce the memory requirements by (5-16x). We design a flexible software prefetcher applicable to any hierarchical clustering based search, achieving hit rates exceeding 90%. ESPN improves SSD based retrieval up to (6.4x) and end-to-end throughput by 68% to maintain near-memory levels of query latency even for large query batch sizes. The code is available at https://github.com/susavlsh10/ESPN-v1.
Tue 25 JunDisplayed time zone: Windhoek change
16:00 - 17:00 | ISMM: Session 4 - PotpourriISMM 2024 at Iceland Chair(s): Tony Hosking Australian National University | ||
16:00 20mTalk | SSRD: Shapes and Summaries for Race Detection in Concurrent Data StructuresRemote ISMM 2024 Xiaofan Sun University of California at Riverside, Rajiv Gupta University of California at Riverside DOI | ||
16:20 20mTalk | A Heuristic for Periodic Memory Allocation with Little Fragmentation to Train Neural Networks ISMM 2024 DOI | ||
16:40 20mTalk | ESPN: Memory-Efficient Multi-vector Information Retrieval ISMM 2024 DOI |