👋 Hi! I am Siteng Huang (黄思腾 in Chinese). I work at DAMO Academy as an Algorithm Expert in Hangzhou. I received my Ph.D. degree from Zhejiang University in June 2024, affiliated with a joint program with Westlake University at Machine Intelligence Laboratory (MiLAB) and advised by Prof. Donglin Wang. Before that, I received my B.Eng. Degree from School of Computer Science, Wuhan University in June 2019.

🔬 My research has centered on the perception, understanding, reasoning, and generation of multimodal (including images, videos, language, dynamics, etc.) data from both the internet and the physical world. I also focus on efficientAI (in terms of data, time, parameters, memory, etc.) when building multimodal applications. I have published 20+ papers on the above topics at the top-tier international AI conferences. Recently, I devote myself to the development of multi-modal generative, embodied, and unified foundation models.

🌟 I am honored to have supervised several self-motivated visiting students and research assistants in their research and publications. If you are seeking any form of academic cooperation, please feel free to email me at siteng.huang[AT]gmail.com (replace [AT] with @). Additionly, I maintain close cooperation with MiLAB from Westlake University. This top-tier robot learning lab is actively looking for visiting students and RAs (please refer to Recruitment). Specially, if you are willing to cooperate with me there, please also send me a copy when sending your CV to the lab. Visiting students can be remote for me.

📢 News

  • 2024/12/13 [Preprint] We released Score and Distribution Matching Policy, which transforms diffusion-based policies into single-step generators through a two-stage optimization process: score matching ensures alignment with true action distributions, and distribution matching minimizes KL divergence for consistency. Project page has been available.
  • 2024/12/10 [Preprint] We released CARP, Coarse-to-fine AutoRegressive Prediction for visuomotor policy learning. The approach produces highly accurate and smooth robot actions, achieving up to a 10% improvement of success rates, and delivers 10x faster inference compared to state-of-the-art policies. Project page with cool videos has been available. Code will be available soon!
  • 2024/12/10 [AAAI’25] Cobra, the first Mamba-based MLLM for efficient inference, got accepted for AAAI 2025! See Project page.
  • 2024/11/27 [Preprint] We released a new work on token reduction for MLLM inference acceleration, which proposes a unified paradigm to demystify the popular works and guide the future designs, and further offers a suite of methods FiCoCo grounded in the paradigm. Project page has been available. Code will be available soon!
  • 2024/09/09 [New Start] Joined Alibaba DAMO Academy as an Algorithm Expert!
  • 2024/07/16 [MM’24] One paper (ProFD) got accepted for ACM MM 2024. Congratulations to all collaborators!
  • 2024/07/09 [Scholar’24] 2024 Scholar Metrics was released by Google Scholar. Our paper “DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting” ranked 7th of the CIKM 2019 conference according to the citations, and 13th within five years.
  • 2024/07/01 [ECCV’24] Two papers (PiTe and QUAR-VLA) got accepted for ECCV 2024. 2024/08/12 PiTe got accepted as an Oral paper!
  • 2024/06/04 [Graduation] I successfully defended my dissertation. So many thanks to my Ph.D. committee (Prof. Xiaogang Jin, Prof. Mai Xu, Prof. Changxin Gao, Prof. Fajie Yuan, Prof. Peidong Liu, Prof. Xiaofei Li) and my advisor!
  • 2024/03/29 [VALSE’24] Troika got accepted as VALSE 2024 Poster! 2024/05/05 Our Cobra was selected for VALSE 2024 Annual Progress Representation. Thanks to all the committee for the approval!
  • 2024/03/13 [ICME’24] One paper (DARA) about parameter-efficient tuning for visual grounding got accepted for ICME 2024 (Oral).
  • 2024/02/27 [Award] Awarded as Zhejiang University 2024 Outstanding Graduates!
  • 2024/02/27 [CVPR’24] Three papers (ADI, Troika, SimM) as first/co-first author got accepted for CVPR 2024. Congratulations to all collaborators!
  • 2023/12/13 [ICASSP’24] One paper (VGDiffZero) on diffusion model-based zero-shot visual grounding got accepted for ICASSP 2024. Congratulations to all collaborators!
  • 2023/12/09 [AAAI’24] One paper on VLM-based unsupervised domain adaptation got accepted for AAAI 2024.
  • 2023/04/02 [ICMR’23] One paper (RL-CZSL) about reference-limited compositional learning got accepted for ICMR 2023. Congratulations to all collaborators!
  • 2023/02/28 [CVPR’23] One paper (VoP) about parameter-efficient text-video retrieval got accepted for CVPR 2023. Congratulations to all collaborators!

📝 Publications

†: Equal contribution ✉: Corresponding author

Peer-reviewed Conference

Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang, "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference". arXiv preprint arXiv:2403.14520. [arXiv] [pdf] [project page] [Chinese intro (Zhihu)] [github] [demo] [video (Youtube)] [机器之心] [Twitter@AK]

Can Cui†, Siteng Huang†, Wenxuan Song, Pengxiang Ding, Zhang Min, Donglin Wang, "ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification". In Proceedings of the 32nd ACM International Conference on Multimedia. [arXiv] [github] [OpenReview]

(Oral) Yang Liu†, Pengxiang Ding†, Siteng Huang, Min Zhang, Han Zhao, Donglin Wang, "PiTe: Pixel-Temporal Alignment for Large Video-Language Model". In Proceedings of the European Conference on Computer Vision 2024. [arXiv] [github] [dataset]

Pengxiang Ding, Han Zhao, Wenxuan Song, Wenjie Zhang, Min Zhang, Siteng Huang, Ningxi Yang, Donglin Wang, "QUAR-VLA: Vision-Language-Action Model for Quadruped Robots". In Proceedings of the European Conference on Computer Vision 2024. [arXiv]

(Oral) Ting Liu†, Xuyang Liu†, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, Yue Hu, "DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding". In Proceedings of the IEEE Conference on Multimedia Expo 2024. [arXiv] [github]

Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang, "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [dataset] [project page] [poster (CVPR 2024)]

Biao Gong†, Siteng Huang†, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu, "Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [poster (CVPR 2024)]

Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang, "Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [github] [poster (CVPR 2024)] [poster (VALSE 2024)]

Xuyang Liu†, Siteng Huang†, Yachen Kang, Honggang Chen, Donglin Wang, "VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders". In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. [arXiv] [code] [poster]

Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen, "Prompt-based Distribution Alignment for Unsupervised Domain Adaptation". In Proceedings of the 38th AAAI Conference on Artificial Intelligence. [arXiv]

Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang, "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. [project page] [arXiv] [open access] [video (Youtube)] [github] [ModelScope] [poster] [slide]

Siteng Huang, Qiyao Wei, Donglin Wang, "Reference-Limited Compositional Zero-Shot Learning". In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. [project page] [arXiv] [video (Google Drive)] [github] [slide]

Min Zhang, Siteng Huang, Wenbin Li, Donglin Wang, "Tree Structure-Aware Few-Shot Image Classification via Hierarchical Aggregation". In Proceedings of the European Conference on Computer Vision 2022. [arXiv] [Chinese intro] [github]

Min Zhang, Siteng Huang, Donglin Wang, "Domain Generalized Few-shot Image Classification via Meta Regularization Network". In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. [pdf] [github]

Zifeng Zhuang, Xintao Xiang, Siteng Huang, Donglin Wang, "HINFShot: A Challenge Dataset for Few-Shot Node Classification in Heterogeneous Information Network". In Proceedings of the 2021 ACM International Conference on Multimedia Retrieval. [pdf]

Zhengyu Chen, Jixie Ge, Heshen Zhan, Siteng Huang, Donglin Wang, "Pareto Self-Supervised Training for Few-Shot Learning". In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [arXiv] [open access]

Siteng Huang, Min Zhang, Yachen Kang, Donglin Wang, "Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition". In Proceedings of the 35th AAAI Conference on Artificial Intelligence. [project page] [arXiv] [code] [poster] [slide]

Siteng Huang, Donglin Wang, Xuehan Wu, Ao Tang, "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting". In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. [project page] [pdf] [code] [poster] [slide]

Preprints & Under Submission

Bofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin Wang, "Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation". arXiv preprint arXiv:2412.09265. [pdf] [project page]

Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang, "CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction". arXiv preprint arXiv:2412.06782. [pdf] [project page] [huggingface paper]

Yuhang Han†, Xuyang Liu†, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang✉, "Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration". arXiv preprint arXiv:2411.17686. [pdf] [project page] [huggingface paper]

Chang Zou†, Xuyang Liu†, Ting Liu, Siteng Huang, Linfeng Zhang, "Accelerating Diffusion Transformers with Token-wise Feature Caching". arXiv preprint arXiv:2410.05317. [pdf] [github]

Fengyuan Dai, Siteng Huang, Min Zhang, Biao Gong, Donglin Wang, "Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning". arXiv preprint arXiv:2408.17083. [pdf]

Xuyang Liu†, Ting Liu†, Siteng Huang, Yue Hu, Quanjun Yin, Donglin Wang, Honggang Chen, "M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension". arXiv preprint arXiv:2407.01131. [pdf]

Ting Liu†, Xuyang Liu†, Siteng Huang, Liangtao Shi, Zunnan Xu, Yi Xin, Quanjun Yin, Xiaohong Liu, "Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference". arXiv preprint arXiv:2405.14700. [pdf] [github]

Thesis

Siteng Huang, "Model Transfer for Multimodal Understanding and Generation". Zhejiang University, 2024.

💻 Internship Experience

💼 Services

Conference Reviewer

  • IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • IEEE/CVF International Conference on Computer Vision (ICCV)
  • European Conference on Computer Vision (ECCV)
  • AAAI Conference on Artificial Intelligence (AAAI)
  • International Joint Conference on Artificial Intelligence (IJCAI)
  • IEEE International Conference on Multimedia and Expo (ICME)
  • ACM International Conference on Multimedia Retrieval (ICMR)
  • Asian Conference on Computer Vision (ACCV)
  • International Conference on Pattern Recognition (ICPR)

Journal Reviewer

  • IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
  • IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
  • ACM Transactions on Intelligent Systems and Technology (ACM TIST)
  • Journal of Visual Communication and Image Representation (JVCI)
  • Concurrency and Computation: Practice and Experience (CPE)

Program Committee for Conferences and Workshops

  • Session Chair, The First Westlake Robot Learning Symposium

😉 Misc

Welcome to follow my Zhihu and XiaoHongShu.