Abstract
Hybrid language models (HLMs) are inference-time architectures that combine the low-latency efficiency of small language models (SLMs) on clients (edge devices) with the high accuracy of large language models (LLMs) in centralized servers. Unlike traditional end-to-end LLM inference, HLMs aim to reduce latency and communication by selectively invoking LLMs only when the local SLM’s predictions are uncertain—that is, when the model exhibits low confidence or high entropy in its token-level probability distribution. However, when the SLM encounters ambiguous or low-confidence predictions during inference, it must offload token-level probability distributions to the LLM for refinement. This frequent offloading leads to substantial communication overhead, particularly in bandwidth-constrained environments. To address this challenge, we propose federated learning (FL)-enabled HLM (FedHLM), a communication-efficient HLM framework that integrates uncertainty-aware inference with FL. The key innovation lies in collaboratively learning token-level uncertainty thresholds that determine when SLM predictions require LLM assistance. Instead of relying on static or hand-tuned thresholds, FedHLM uses FL to enable distributed threshold optimization across clients while preserving data privacy. Additionally, embedding-based token representations are employed to facilitate semantic similarity comparisons during peer-to-peer (P2P) resolution, allowing clients to reuse tokens inferred by similar peers without efficiently involving the LLM. Moreover, we propose hierarchical model aggregation as a strategy to reduce redundant token transmissions. At the edge server level, client updates are aggregated to refine local routing policies, while global coordination across clusters further synchronizes decision boundaries. This layered approach ensures that repeated uncertainty patterns are captured and resolved locally, significantly reducing unnecessary LLM queries. Extensive simulations on large-scale news classification tasks demonstrate that FedHLM achieves over 95% reduction in LLM transmissions with negligible accuracy loss, highlighting its potential for scalable and efficient edge-artificial intelligence (AI) deployment.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 53574-53592 |
| Number of pages | 19 |
| Journal | IEEE Internet of Things Journal |
| Volume | 12 |
| Issue number | 24 |
| DOIs | |
| State | Published - Dec 2025 |
| Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Signal Processing
- Information Systems
- Hardware and Architecture
- Computer Science Applications
- Computer Networks and Communications
Keywords
- Federated learning (FL)
- hybrid language models (HLMs)
- large language models (LLMs)
- mobile edge computing
- small language models (SLMs)
Fingerprint
Dive into the research topics of 'Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver