Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.

Original languageEnglish (US)
Title of host publicationMobiHoc 2025 - Proceedings of the 2025 International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing.
PublisherAssociation for Computing Machinery, Inc
Pages201-210
Number of pages10
ISBN (Electronic)9798400713538
DOIs
StatePublished - Oct 23 2025
Externally publishedYes
Event26th International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, MobiHoc 2025 - Houston, United States
Duration: Oct 27 2025Oct 30 2025

Publication series

NameMobiHoc 2025 - Proceedings of the 2025 International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing.

Conference

Conference26th International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, MobiHoc 2025
Country/TerritoryUnited States
CityHouston
Period10/27/2510/30/25

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Artificial Intelligence
  • Computer Science Applications
  • Hardware and Architecture

Keywords

  • inference offloading
  • large language model
  • multi-modal
  • reinforcement learning
  • resource constraint

Fingerprint

Dive into the research topics of 'Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings'. Together they form a unique fingerprint.

Cite this