👋 Hi! I am Siteng Huang (黄思腾 in Chinese). I work at DAMO Academy, Alibaba Group, as an Algorithm Expert through the AliStar program. I received my Ph.D. degree from Zhejiang University in June 2024, affiliated with a joint program with Westlake University at Machine Intelligence Laboratory (MiLAB) and advised by Prof. Donglin Wang. In my Ph.D. study, I also spent wonderful internship time at TongYi Lab, Alibaba Group. Before that, I received my B.Eng. Degree from School of Computer Science, Wuhan University in June 2019.

🔬 My research has centered on the perception, understanding, reasoning, and generation of multimodal (including images, videos, language, dynamics, etc.) data from both the internet and the physical world. I also focus on efficientAI (in terms of data, time, parameters, memory, etc.) for multimodal applications. I have published 30+ papers on the above topics at the top-tier international AI conferences and journals. Recently, I devote myself to the development of multi-modal generative, embodied, and unified foundation models.

📢 News

2026/02/10 [RynnBrain] We presented RynnBrain, an embodied foundation model grounded in physical reality, including dense (2B, 8B) and MoE (30B) variants, alongside three specialized models: RynnBrain‑Plan (manipulation planning), RynnBrain‑Nav (navigation), and RynnBrain‑CoP (spatial reasoning). See Github and Chinese report from 机器之心.
2026/01/31 [ICRA’26] RynnVLA-001, the VLA foundation model, got accepted for ICRA 2026!
2026/01/22 [Talk] I gave a talk titled Physical AI Ecosystem: Tackling the Key Barriers to Embodied Intelligence in AAAI-26 Interactive Industry Sessions.
2025/12/11 [Preprint] We released HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a “think-while-acting” paradigm for long-horizon manipulation! Project page and Code are available!
2025/11/24 [Preprint] We released RynnVLA-002, an upgraded version of WorldVLA, a more powerful VLA and world model unified model! Get videos and code at Github!
2025/11/08 [AAAI’26] 4 papers got accepted for AAAI 2026! They included training-free MLLM inference acceleration methods FiCoCo and GlobalCom², dexterous grasping policy AffordDex, and tiny-scale VLA VLA-Adapter.
2025/10/14 [Preprint] We released RoboSimGS, a novel Real2Sim2Real framework that converts multi-view real-world images into scalable, highfidelity, and physically interactive simulation environments for robotic manipulation! An overview video can be found in Project page!
2025/09/19 [NeurIPS’25] SSR got accepted for NeurIPS 2025! The work transforms raw depth data into structured, interpretable textual CoT, enhancing spatial reasoning capabilities of MLLMs. See Project page and Github!
2025/09/12 [Preprint] We released VLA-Adapter, which reduces reliance on large-scale VLMs and extensive pre-training by using a lightweight Policy module with Bridge Attention, achieving SOTA performance and fast inference speed with minimal computational resources! Checkpoint has been available! See Project page for more details. Got #1 Paper of the day on huggingface papers! 2025/11/08 VLA-Adapter got accepted for AAAI 2026 Oral!
2025/08/13 [Preprint] We released AffordDex, a universal grasping policy for dexterous hands with an inherent understanding of both motion priors and object affordances! Grasping videos can be found in Project page! 2025/11/08 AffordDex got accepted for AAAI 2026!
2025/08/08 [DAMO RynnBot] We open-sourced RynnEC: a video MLLM for embodied cognition tasks, RynnVLA-001: a VLA model based on pretrained video generation model, RynnRCP: a complete set of robot service agreements and frameworks! 2025/08/11 We released the technical blog for RynnVLA-001! 2025/09/19 We released the technical report for RynnVLA-001!
2025/08/02 [CoRL’25] Long-VLA, a novel framework designed to enhance VLA models for challenging long-horizon robotic manipulation tasks, got accepted for CoRL 2025!
2025/07/24 [DAMO RynnBot] We released RynnBot PlayGround Beta, a platform that provides data management, SOTA VLA models, model training and validation, cloud-edge collaborative deployment, and so on! Welcome to follow our continuous progress!
2025/06/27 [Preprint] We released WorldVLA, an autoregressive action world model that unifies action and image understanding and generation! Code has been available!
2025/06/26 [ICCV’25] CARP, Coarse-to-fine AutoRegressive Prediction for visuomotor policy learning, got accepted for ICCV 2025! The approach produces highly accurate and smooth robot actions, achieving up to a 10% improvement of success rates, and delivers 10x faster inference compared to state-of-the-art policies. Paper, code and cool videos can be found in Project page!
2025/05/22 [Preprint] We released VARD, a novel RL fine-tuning method on diffusion-based generative models for both protein structure and text-to-image synthesis, enhancing sample quality with improved efficiency, effective mitigation of reward hacking, and broad applicability.
2025/05/07 [Preprint] We released OpenHelix, a low-cost open-source dual-system VLA with systematic empirical evaluations on the core design elements. Code and List of papers have been available!
2025/03/31 [Preprint] We released Unicorn to explore the question: can high-quality multimodal training data be synthesized purely from text?
2025/03/28 [Survey Preprint] We released Exploring the Evolution of Physics Cognition in Video Generation: A Survey, which dives deep into the development of physics cognition in video generation, from basic perception to active cognition! List of papers has been available!
2025/03/11 [TCSVT’25] M2IST, a novel Multi-Modal Interactive Side-Tuning method that effectively addresses the challenges of insufficient multi-modal interaction and high GPU memory consumption, got accepted for IEEE Transactions on Circuits and Systems for Video Technology! Code has been available!
2025/02/24 [Preprint] We released Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control!
2025/01/28 [ICRA’25] QUART-Online, a novel latency-free quadruped MLLM model that achieves real-time inference while boosting the success rate across various tasks by 65%, got accepted for ICRA 2025! See Project page.
2025/01/23 [ICLR’25] ToCa, a token-wise feature caching method that achieves a 2x acceleration for PixArt-α, OpenSora, and DiT while maintaining nearly lossless generation quality, got accepted for ICLR 2025! Code has been available!
2025/01/10 [Preprint] We released GlobalCom², a “global-to-local” approach for training-free acceleration of high-resolution MLLMs with AnyRes strategy. Code has been available! 2025/11/08 GlobalCom² got accepted for AAAI 2026!

📝 Publications

†: Equal contribution ✉: Corresponding author

Peer-reviewed Conference

Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li, "RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation". In Proceedings of the 2026 IEEE International Conference on Robotics and Automation. [arXiv] [huggingface paper] [technical blog] [github]

(Oral) Yihao Wang†, Pengxiang Ding†, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang, "VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model". In Proceedings of the 40th AAAI Conference on Artificial Intelligence. [arXiv] [huggingface paper](#1 Paper of the day) [project page] [huggingface] [Twitter@AK]

Haoyu Zhao†, Linghao Zhuang†, Xingyue Zhao†, Cheng Zeng, Haoran Xu, Yuming Jiang, Jun Cen, Kexiang Wang, Jiayan Guo, Siteng Huang✉, Xin Li, Deli Zhao, Hua Zou✉, "Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors". In Proceedings of the 40th AAAI Conference on Artificial Intelligence. [arXiv] [huggingface paper] [project page]

Yuhang Han†, Xuyang Liu†, Zihan Zhang, Pengxiang Ding, Junjie Chen, Honggang Chen, Donglin Wang, Qingsen Yan, Siteng Huang✉, "Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration". In Proceedings of the 40th AAAI Conference on Artificial Intelligence. [arXiv] [project page] [huggingface paper] [github]

Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Siteng Huang, Honggang Chen, "Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models". In Proceedings of the 40th AAAI Conference on Artificial Intelligence. [arXiv] [github]

Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang, "SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning". In Proceedings of the 39th Annual Conference on Neural Information Processing Systems. [pdf] [project page] [github] [huggingface paper]

Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, Zhaoxin Fan, Badong Chen, Donglin Wang, "Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation". In Proceedings of the 9th Annual Conference on Robot Learning. [arXiv] [project page]

Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang, "CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction". In Proceedings of the International Conference on Computer Vision 2025. [arXiv] [project page] [huggingface paper]

Xinyang Tong, Pengxiang Ding, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Yiguo Fan, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu, "QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning". In Proceedings of the 2025 IEEE International Conference on Robotics and Automation. [arXiv] [project page]

Chang Zou†, Xuyang Liu†, Ting Liu, Siteng Huang, Linfeng Zhang, "Accelerating Diffusion Transformers with Token-wise Feature Caching". In Proceedings of the 13th International Conference on Learning Representations. [arXiv] [github]

Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang, "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference". In Proceedings of the 39th AAAI Conference on Artificial Intelligence. [arXiv] [pdf] [project page] [Chinese intro (Zhihu)] [github] [demo] [video (Youtube)] [机器之心] [Twitter@AK]

Can Cui†, Siteng Huang†, Wenxuan Song, Pengxiang Ding, Zhang Min, Donglin Wang, "ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification". In Proceedings of the 32nd ACM International Conference on Multimedia. [arXiv] [github] [OpenReview]

(Oral) Yang Liu†, Pengxiang Ding†, Siteng Huang, Min Zhang, Han Zhao, Donglin Wang, "PiTe: Pixel-Temporal Alignment for Large Video-Language Model". In Proceedings of the European Conference on Computer Vision 2024. [arXiv] [github] [dataset]

Pengxiang Ding, Han Zhao, Wenxuan Song, Wenjie Zhang, Min Zhang, Siteng Huang, Ningxi Yang, Donglin Wang, "QUAR-VLA: Vision-Language-Action Model for Quadruped Robots". In Proceedings of the European Conference on Computer Vision 2024. [arXiv]

(Oral) Ting Liu†, Xuyang Liu†, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, Yue Hu, "DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding". In Proceedings of the IEEE Conference on Multimedia Expo 2024. [arXiv] [github]

Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang, "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [dataset] [project page] [poster (CVPR 2024)]

Biao Gong†, Siteng Huang†, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu, "Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [poster (CVPR 2024)]

Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang, "Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [github] [poster (CVPR 2024)] [poster (VALSE 2024)]

Xuyang Liu†, Siteng Huang†, Yachen Kang, Honggang Chen, Donglin Wang, "VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders". In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. [arXiv] [code] [poster]

Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen, "Prompt-based Distribution Alignment for Unsupervised Domain Adaptation". In Proceedings of the 38th AAAI Conference on Artificial Intelligence. [arXiv]

Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang, "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. [project page] [arXiv] [open access] [video (Youtube)] [github] [ModelScope] [poster] [slide]

Siteng Huang, Qiyao Wei, Donglin Wang, "Reference-Limited Compositional Zero-Shot Learning". In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. [project page] [arXiv] [video (Google Drive)] [github] [slide]

Min Zhang, Siteng Huang, Wenbin Li, Donglin Wang, "Tree Structure-Aware Few-Shot Image Classification via Hierarchical Aggregation". In Proceedings of the European Conference on Computer Vision 2022. [arXiv] [Chinese intro] [github]

Min Zhang, Siteng Huang, Donglin Wang, "Domain Generalized Few-shot Image Classification via Meta Regularization Network". In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. [pdf] [github]

Zifeng Zhuang, Xintao Xiang, Siteng Huang, Donglin Wang, "HINFShot: A Challenge Dataset for Few-Shot Node Classification in Heterogeneous Information Network". In Proceedings of the 2021 ACM International Conference on Multimedia Retrieval. [pdf]

Zhengyu Chen, Jixie Ge, Heshen Zhan, Siteng Huang, Donglin Wang, "Pareto Self-Supervised Training for Few-Shot Learning". In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [arXiv] [open access]

Siteng Huang, Min Zhang, Yachen Kang, Donglin Wang, "Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition". In Proceedings of the 35th AAAI Conference on Artificial Intelligence. [project page] [arXiv] [code] [poster] [slide]

Siteng Huang, Donglin Wang, Xuehan Wu, Ao Tang, "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting". In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. [project page] [pdf] [code] [poster] [slide]

Peer-reviewed Journal

Xuyang Liu†, Ting Liu†, Siteng Huang✉, Yi Xin, Yue Hu, Long Qin, Donglin Wang, Honggang Chen✉, "M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension". IEEE Transactions on Circuits and Systems for Video Technology, 2025. [arXiv] [github]

Preprints & Under Submission

Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang✉, Donglin Wang✉, "HiF-VLA: Hindsight, Insight and Foresight for Vision-Language-Action models". arXiv preprint arXiv:2512.09928. [pdf] [huggingface paper] [github] [project page]

Jun Cen†, Siteng Huang†, Yuqian Yuan†, Kehan Li†, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Hao Luo, Fan Wang, Xin Li, Deli Zhao, Hao Chen, "RynnVLA-002: A Unified Vision-Language-Action and World Model". arXiv preprint arXiv:2511.17502. [pdf] [huggingface paper] [github] [Twitter@AK]

Haoyu Zhao†, Cheng Zeng†, Linghao Zhuang†, Yaxi Zhao, Shengke Xue, Hao Wang, Xingyue Zhao, Zhongyu Li, Kehan Li, Siteng Huang✉, Mingxiu Chen, Xin Li, Deli Zhao, Hua Zou✉, "High-Fidelity Simulated Data Generation for Real-World Zero-Shot Robotic Manipulation Learning with Gaussian Splatting". arXiv preprint arXiv:2510.10637. [pdf] [huggingface paper] [project page]

Yichen Han, Yuhang Han, Siteng Huang, Guanyu Liu, Zhengpeng Zhou, Bojun Liu, Yujia Zhang, Isaac N Shi, Lewei He, Tianyu Shi, "MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization". arXiv preprint arXiv:2509.11361. [pdf] [github]

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen, "Variation-aware Vision Token Dropping for Faster Large Vision-Language Models". arXiv preprint arXiv:2509.01552. [pdf] [github]

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, Hao Chen, "WorldVLA: Towards Autoregressive Action World Model". arXiv preprint arXiv:2506.21539. [pdf] [huggingface paper] [github]

Fengyuan Dai†, Zifeng Zhuang†, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, Fajie Yuan, "VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL". arXiv preprint arXiv:2505.15791. [pdf] [huggingface paper]

Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, Han Zhao, Siteng Huang, Donglin Wang, "OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation". arXiv preprint arXiv:2505.03912. [pdf] [project page] [github] [Awesome List] [huggingface paper]

Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, Donglin Wang, "Unicorn: Text-Only Data Synthesis for Vision Language Model Training". arXiv preprint arXiv:2503.22655. [pdf] [github]

Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang✉, and Donglin Wang✉, "Exploring the Evolution of Physics Cognition in Video Generation: A Survey". arXiv preprint arXiv:2503.21765. [pdf] [github] [huggingface paper]

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, Donglin Wang, "Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration". arXiv preprint arXiv:2502.14795. [pdf]

Bofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin Wang, "Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation". arXiv preprint arXiv:2412.09265. [pdf] [project page]

Fengyuan Dai, Siteng Huang, Min Zhang, Biao Gong, Donglin Wang, "Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning". arXiv preprint arXiv:2408.17083. [pdf]

Ting Liu†, Xuyang Liu†, Siteng Huang, Liangtao Shi, Zunnan Xu, Yi Xin, Quanjun Yin, Xiaohong Liu, "Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference". arXiv preprint arXiv:2405.14700. [pdf] [github]

Thesis

Siteng Huang, "Model Transfer for Multimodal Understanding and Generation". Zhejiang University, 2024.

💻 Internship Experience

Research Intern - DAMO Academy & TongYi Lab, Alibaba Group (阿里巴巴达摩院 & 通义实验室)
- Time: March 2022 - July 2024.
- Fundamental Visual Intelligence Team for Tongyi Wanxiang (WanX).

💼 Services

Conference Reviewer

Annual Conference on Neural Information Processing Systems (NeurIPS)
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
IEEE/CVF International Conference on Computer Vision (ICCV)
European Conference on Computer Vision (ECCV)
British Machine Vision Conference (BMVC)
Annual Meeting of the Association for Computational Linguistics (ACL)
AAAI Conference on Artificial Intelligence (AAAI)
International Joint Conference on Artificial Intelligence (IJCAI)
ACM International Conference on Multimedia (ACMMM)
Conference on Robot Learning (CoRL)
IEEE International Conference on Robotics and Automation (ICRA)
IEEE International Conference on Multimedia and Expo (ICME)
ACM International Conference on Multimedia Retrieval (ICMR)
Asian Conference on Computer Vision (ACCV)
International Conference on Pattern Recognition (ICPR)

Journal Reviewer

IEEE Robotics and Automation Letters (RA-L)
IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
IEEE Transactions on Multimedia (TMM)
ACM Transactions on Intelligent Systems and Technology (ACM TIST)
ACM Transactions on Information Systems (ACM TOIS)
Journal of Visual Communication and Image Representation (JVCI)
Concurrency and Computation: Practice and Experience (CPE)

Program Committee for Conferences and Workshops

Session Chair, AAAI 2026
Session Chair, The First Westlake Robot Learning Symposium

😉 Misc

Welcome to follow my XiaoHongShu and Zhihu.

Siteng Huang (黄思腾)