Learning to Manipulate by Watching Humans: A Decoupled Vision-Language-Driven Imitation Framework

Thanh Nguyen Canh

Thanh Tuan Tran

Haolan Zhang

Ziyan Gao

Xiem HoangVan

Nak Young Chong

School of Information Science, JAIST, Japan
VNU University of Engineering and Technology, Vietnam
Department of Robotics, Hanyang University, Korea
, 2026.

[Paper]

[Code]

Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by ``\textit{watching}'' and ``\textit{imitating}''. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments using a UR5e robotic arm across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves $89.97\%$ action classification accuracy and attains BLEU-4 scores of $0.351$ on standard objects and $0.265$ on novel objects, representing improvements of $76.4\%$ and $128.4\%$ over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of $87.5\%$ across all actions, with $100\%$ success on reaching tasks and up to $90\%$ on complex pick-and-place operations. These results demonstrate strong generalization capability, particularly for previously unseen object categories.

Paper

Thanh Nguyen Canh, Thanh Tuan Tran, Haolan Zhang, Ziyan Gao, Xiem HoangVan, Nak Young Chong

Learning to Manipulate by Watching Humans: A Decoupled Vision-Language-Driven Imitation Framework

2026.

[pdf]

Overview

Settings

The proposed Video Understanding architecture consists of two parallel branches: (a) Interacted Object Understanding Module and (b) Action Understand- ing Module. Initially, the raw input frames F = {f1, f2, . . . , fn} are downsampled to ˜F = {˜f1, ˜f2, . . . , ˜fm} (where n > m) to optimize runtime and match the training data frame rate in Action Understanding Module. (a) The Interacted Object Understand- ing Module processes ˜F to extract a subset of keyframes ˆF = {ˆf1, ˆf2, . . . , ˆfk}, (where m > k). These frames are analyzed by our Object Selection algorithm and Vision- Language Models (VLMs) to identify the specific objects involved in the interaction accurately. (b) The Action Understanding Module is implemented based on a CNN architecture with a ResNet-50 [47] backbone with Temporal Shift Modules (TSM) that shift feature channels along the temporal dimension to capture fine-grained motion dynamics for action classification.

Experiments Results


Video-to-command generation performance (BLEU scores) on standard object sets (bold and underline are best and second best, respectively).

Comparison of DRL algorithm performance across four manipulation actions.

Box plots comparing average rewards for four manipulation actions.


Reach action with UR5 robot (simulation).


Pick action with UR5 robot (simulation).


Move action with UR5 robot (simulation).


Put action with UR5 robot (simulation).


Reach action with UF850 robot (simulation).


Pick action with UF850 robot (simulation).


Move action with UF850 robot (simulation).


Put action with UF850 robot (simulation).


Real experiment with Reach and Pick action using UF850 robot.

Code

[github]

Citation

1. Canh T. N., Thanh T. T., Zhang H., Gao Z., HoangVan X., Chong A.Y. Learning to Manipulate by Watching Humans: A Decoupled Vision-Language-Driven Imitation Framework. , 2026.

@article{canh2026human, 
                                  

                                    author    = {Canh, Thanh Nguyen and Thanh, Tuan Tran and Zhang, Haolan and Gao, Ziyan and HoangVan, Xiem and Chong, Nak Young}, 
                                  

                                    title     = {Learning to Manipulate by Watching Humans: A Decoupled Vision-Language-Driven Imitation Framework}, 
                                  

                                    booktitle = {}, 
                                  

                                    year      = {2026}, 
                                  

                                    address   = {}, 
                                  

                                    month     = {}, 
                                  

                                    DOI       = {}
                                  

                                }

Acknowledgements

This work was supported by JST SPRING, Japan Grant Number JPMJSP2102.
This webpage template was borrowed from https://akanazawa.github.io/cmr/.