TY - GEN
T1 - METIS
T2 - 31st ACM Symposium on Operating Systems Principles, SOSP 2025
AU - Ray, Siddhant
AU - Pan, Rui
AU - Gu, Zhuohan
AU - Du, Kuntai
AU - Feng, Shaoting
AU - Ananthanarayanan, Ganesh
AU - Netravali, Ravi
AU - Jiang, Junchen
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/10/12
Y1 - 2025/10/12
N2 - RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge causes higher response delay. Prior work focuses either on reducing the response delay (e.g., better scheduling of RAG queries) or on maximizing quality (e.g., tuning the RAG workflow), but they fall short in systematically balancing the tradeoff between the delay and quality of RAG responses. To balance both quality and response delay, this paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods. Using four popular RAG-QA datasets, we show that compared to the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by 1.64 - 2.54× without sacrificing generation quality.
AB - RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge causes higher response delay. Prior work focuses either on reducing the response delay (e.g., better scheduling of RAG queries) or on maximizing quality (e.g., tuning the RAG workflow), but they fall short in systematically balancing the tradeoff between the delay and quality of RAG responses. To balance both quality and response delay, this paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods. Using four popular RAG-QA datasets, we show that compared to the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by 1.64 - 2.54× without sacrificing generation quality.
KW - LLM inference
KW - RAG systems
KW - scheduling
UR - https://www.scopus.com/pages/publications/105020852221
UR - https://www.scopus.com/pages/publications/105020852221#tab=citedBy
U2 - 10.1145/3731569.3764855
DO - 10.1145/3731569.3764855
M3 - Conference contribution
AN - SCOPUS:105020852221
T3 - SOSP 2025 - Proceedings of the 2025 ACM SIGOPS 31st Symposium on Operating Systems Principles
SP - 606
EP - 622
BT - SOSP 2025 - Proceedings of the 2025 ACM SIGOPS 31st Symposium on Operating Systems Principles
PB - Association for Computing Machinery, Inc
Y2 - 13 October 2025 through 16 October 2025
ER -