VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

¹Nanjing University, ²Tencent Youtu Lab, ³CASIA, ⁴Fourier Intelligence Inc.

^†Corresponding author, ^‡Project leader

Abstract

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless human-robot collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel human-robot interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an "Active Model" and a "Standby Model", allowing the robot to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a "model-as-controller" paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid robot demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable robotic assistants. Our homepage is https://lxysl.github.io/VITA-E/.

System Architecture

Architecture: VITA-E follows two principles: VLM-as-controller and a dual-model core. The VLM (System-2) handles high-level understanding, while a diffusion-based action expert (System-1) handles low-level motor control. The first token controls the system state: [RES] for voice-only response, [ACT] to enter action mode with the instruction after [INST] as the semantic goal, [HALT] for immediate stop, and [END] for task completion. The Listening model can preempt or run concurrently at any time.

Interaction Modes

Interaction modes: speech interruption, concurrent speech+action, task switching, and emergency stop.

Speech interruption: 100%

Emergency stop: 100%

Task switching: 93.3%

Avg. voice response latency: 2.26s

Key Features

Dual-model interaction core Special control tokens Speech-action concurrency Unified interruption mechanism Compatible with mainstream VLAs

Special Control Tokens

Token	Description	Example Model Output
`[RES]`	Signals a voice-only response. Generated as the first token for conversational replies.	`[RES]` I see an apple on the table.
`[ACT]`	Signals that the response includes a physical action. Generated as the first token to enter action mode.	`[ACT]` Okay, I will put the toy in the box. `[INST]` Pick up toy and place in box.
`[INST]`	Delimits the spoken part of an action response from the internal action instruction that follows.	Used after `[ACT]` to separate speech from the action instruction.
`[HALT]`	Commands an immediate stop of the current action. Generated as the first token for emergency stops.	`[HALT]` Stopping immediately.
`[END]`	Signals that a multi-step action sequence has been successfully completed.	`[END]` The action is finished.

Experiments

We evaluate on the Fourier GR2 humanoid platform and the LIBERO benchmark. VITA-E is competitive on fundamental pick-and-place tasks and reliably delivers concurrency, interruption, and emergency stop in interactive scenarios.

Comparison on two manipulation tasks: success rates over 30 trials.

LIBERO benchmark: competitive performance with frozen VLM.

BibTeX

@article{liu2025vitae, title={VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting}, author={Liu, Xiaoyu and Fu, Chaoyou and Yan, Chi and Wu, Chu and Gao, Haihan and Zhang, Yi-Fan and Dong, Shaoqi and Qian, Cheng and Luo, Bin and Yang, Xiuyong and Li, Guanwu and Cai, Yusheng and Shen, Yunhang and Jiang, Deqiang and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran}, journal={arXiv preprint arXiv:2510.XXXXX}, year={2025} }