📝 Publications

†: Equal contribution ✉: Corresponding author

Peer-reviewed Conference

Xinyang Tong, Pengxiang Ding, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Yiguo Fan, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu, "QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning". In Proceedings of the 2025 IEEE International Conference on Robotics and Automation. [arXiv] [project page]

Chang Zou†, Xuyang Liu†, Ting Liu, Siteng Huang, Linfeng Zhang, "Accelerating Diffusion Transformers with Token-wise Feature Caching". In Proceedings of the 13th International Conference on Learning Representations. [arXiv] [github]

Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang, "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference". In Proceedings of the 39th AAAI Conference on Artificial Intelligence. [arXiv] [pdf] [project page] [Chinese intro (Zhihu)] [github] [demo] [video (Youtube)] [机器之心] [Twitter@AK]

Can Cui†, Siteng Huang†, Wenxuan Song, Pengxiang Ding, Zhang Min, Donglin Wang, "ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification". In Proceedings of the 32nd ACM International Conference on Multimedia. [arXiv] [github] [OpenReview]

(Oral) Yang Liu†, Pengxiang Ding†, Siteng Huang, Min Zhang, Han Zhao, Donglin Wang, "PiTe: Pixel-Temporal Alignment for Large Video-Language Model". In Proceedings of the European Conference on Computer Vision 2024. [arXiv] [github] [dataset]

Pengxiang Ding, Han Zhao, Wenxuan Song, Wenjie Zhang, Min Zhang, Siteng Huang, Ningxi Yang, Donglin Wang, "QUAR-VLA: Vision-Language-Action Model for Quadruped Robots". In Proceedings of the European Conference on Computer Vision 2024. [arXiv]

(Oral) Ting Liu†, Xuyang Liu†, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, Yue Hu, "DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding". In Proceedings of the IEEE Conference on Multimedia Expo 2024. [arXiv] [github]

Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang, "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [dataset] [project page] [poster (CVPR 2024)]

Biao Gong†, Siteng Huang†, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu, "Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [poster (CVPR 2024)]

Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang, "Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [github] [poster (CVPR 2024)] [poster (VALSE 2024)]

Xuyang Liu†, Siteng Huang†, Yachen Kang, Honggang Chen, Donglin Wang, "VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders". In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. [arXiv] [code] [poster]

Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen, "Prompt-based Distribution Alignment for Unsupervised Domain Adaptation". In Proceedings of the 38th AAAI Conference on Artificial Intelligence. [arXiv]

Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang, "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. [project page] [arXiv] [open access] [video (Youtube)] [github] [ModelScope] [poster] [slide]

Siteng Huang, Qiyao Wei, Donglin Wang, "Reference-Limited Compositional Zero-Shot Learning". In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. [project page] [arXiv] [video (Google Drive)] [github] [slide]

Min Zhang, Siteng Huang, Wenbin Li, Donglin Wang, "Tree Structure-Aware Few-Shot Image Classification via Hierarchical Aggregation". In Proceedings of the European Conference on Computer Vision 2022. [arXiv] [Chinese intro] [github]

Min Zhang, Siteng Huang, Donglin Wang, "Domain Generalized Few-shot Image Classification via Meta Regularization Network". In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. [pdf] [github]

Zifeng Zhuang, Xintao Xiang, Siteng Huang, Donglin Wang, "HINFShot: A Challenge Dataset for Few-Shot Node Classification in Heterogeneous Information Network". In Proceedings of the 2021 ACM International Conference on Multimedia Retrieval. [pdf]

Zhengyu Chen, Jixie Ge, Heshen Zhan, Siteng Huang, Donglin Wang, "Pareto Self-Supervised Training for Few-Shot Learning". In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [arXiv] [open access]

Siteng Huang, Min Zhang, Yachen Kang, Donglin Wang, "Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition". In Proceedings of the 35th AAAI Conference on Artificial Intelligence. [project page] [arXiv] [code] [poster] [slide]

Siteng Huang, Donglin Wang, Xuehan Wu, Ao Tang, "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting". In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. [project page] [pdf] [code] [poster] [slide]

Peer-reviewed Journal

Xuyang Liu†, Ting Liu†, Siteng Huang✉, Yi Xin, Yue Hu, Long Qin, Donglin Wang, Honggang Chen✉, "M2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension". IEEE Transactions on Circuits and Systems for Video Technology, 2025. [arXiv] [github]

Preprints & Under Submission

Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, Donglin Wang, "Unicorn: Text-Only Data Synthesis for Vision Language Model Training". arXiv preprint arXiv:2503.22655. [pdf] [github]

Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang✉, and Donglin Wang✉, "Exploring the Evolution of Physics Cognition in Video Generation: A Survey". arXiv preprint arXiv:2503.21765. [pdf] [github] [huggingface paper]

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, Donglin Wang, "Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration". arXiv preprint arXiv:2502.14795. [pdf]

Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen, "Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models". arXiv preprint arXiv:2501.05179. [pdf] [github]

Bofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin Wang, "Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation". arXiv preprint arXiv:2412.09265. [pdf] [project page]

Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang, "CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction". arXiv preprint arXiv:2412.06782. [pdf] [project page] [huggingface paper]

Yuhang Han†, Xuyang Liu†, Zihan Zhang, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang✉, "Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration". arXiv preprint arXiv:2411.17686. [pdf] [project page] [huggingface paper] [github]

Fengyuan Dai, Siteng Huang, Min Zhang, Biao Gong, Donglin Wang, "Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning". arXiv preprint arXiv:2408.17083. [pdf]

Ting Liu†, Xuyang Liu†, Siteng Huang, Liangtao Shi, Zunnan Xu, Yi Xin, Quanjun Yin, Xiaohong Liu, "Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference". arXiv preprint arXiv:2405.14700. [pdf] [github]

Thesis

Siteng Huang, "Model Transfer for Multimodal Understanding and Generation". Zhejiang University, 2024.