Speech & Audio AI

VoxServe: Streaming-Centric Serving System for Speech Language Models

KKeisuke KamahoriWWei-Tzu LeeAAtindra JhaRRohan KadekodiSStephanie WangAArvind KrishnamurthyBBaris Kasikci
Published
January 30, 2026
Authors
7
Word Count
6,713

VoxServe revolutionizes real-time speech model deployment.

Abstract

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.

Key Takeaways

  • 1

    VoxServe enhances performance and simplifies speech model deployment.

  • 2

    Significant improvements in throughput and latency achieved.

  • 3

    Model-execution abstraction allows holistic optimizations across SpeechLMs.

Limitations

  • Requires powerful GPUs for optimal performance.

  • Performance gains vary depending on the model architecture.

Keywords

Speech Language Modelsstreaming settingslow latencyhigh throughputstreamabilitymodel-execution abstractionstreaming-aware schedulingasynchronous inference pipeline

More in Speech & Audio AI

View all
VoxServe: Streaming-Centric Serving System for Speech Language Models | Paperchime