📝 Publications

†: Equal contribution ✉: Corresponding author

Peer-reviewed Conference

Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang, "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference". In Proceedings of the 39th AAAI Conference on Artificial Intelligence. [arXiv] [pdf] [project page] [Chinese intro (Zhihu)] [github] [demo] [video (Youtube)] [机器之心] [Twitter@AK]

Can Cui†, Siteng Huang†, Wenxuan Song, Pengxiang Ding, Zhang Min, Donglin Wang, "ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification". In Proceedings of the 32nd ACM International Conference on Multimedia. [arXiv] [github] [OpenReview]

(Oral) Yang Liu†, Pengxiang Ding†, Siteng Huang, Min Zhang, Han Zhao, Donglin Wang, "PiTe: Pixel-Temporal Alignment for Large Video-Language Model". In Proceedings of the European Conference on Computer Vision 2024. [arXiv] [github] [dataset]

Pengxiang Ding, Han Zhao, Wenxuan Song, Wenjie Zhang, Min Zhang, Siteng Huang, Ningxi Yang, Donglin Wang, "QUAR-VLA: Vision-Language-Action Model for Quadruped Robots". In Proceedings of the European Conference on Computer Vision 2024. [arXiv]

(Oral) Ting Liu†, Xuyang Liu†, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, Yue Hu, "DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding". In Proceedings of the IEEE Conference on Multimedia Expo 2024. [arXiv] [github]

Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang, "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [dataset] [project page] [poster (CVPR 2024)]

Biao Gong†, Siteng Huang†, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu, "Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [poster (CVPR 2024)]

Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang, "Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024. [arXiv] [project page] [github] [poster (CVPR 2024)] [poster (VALSE 2024)]

Xuyang Liu†, Siteng Huang†, Yachen Kang, Honggang Chen, Donglin Wang, "VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders". In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. [arXiv] [code] [poster]

Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen, "Prompt-based Distribution Alignment for Unsupervised Domain Adaptation". In Proceedings of the 38th AAAI Conference on Artificial Intelligence. [arXiv]

Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang, "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval". In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. [project page] [arXiv] [open access] [video (Youtube)] [github] [ModelScope] [poster] [slide]

Siteng Huang, Qiyao Wei, Donglin Wang, "Reference-Limited Compositional Zero-Shot Learning". In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. [project page] [arXiv] [video (Google Drive)] [github] [slide]

Min Zhang, Siteng Huang, Wenbin Li, Donglin Wang, "Tree Structure-Aware Few-Shot Image Classification via Hierarchical Aggregation". In Proceedings of the European Conference on Computer Vision 2022. [arXiv] [Chinese intro] [github]

Min Zhang, Siteng Huang, Donglin Wang, "Domain Generalized Few-shot Image Classification via Meta Regularization Network". In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. [pdf] [github]

Zifeng Zhuang, Xintao Xiang, Siteng Huang, Donglin Wang, "HINFShot: A Challenge Dataset for Few-Shot Node Classification in Heterogeneous Information Network". In Proceedings of the 2021 ACM International Conference on Multimedia Retrieval. [pdf]

Zhengyu Chen, Jixie Ge, Heshen Zhan, Siteng Huang, Donglin Wang, "Pareto Self-Supervised Training for Few-Shot Learning". In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [arXiv] [open access]

Siteng Huang, Min Zhang, Yachen Kang, Donglin Wang, "Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition". In Proceedings of the 35th AAAI Conference on Artificial Intelligence. [project page] [arXiv] [code] [poster] [slide]

Siteng Huang, Donglin Wang, Xuehan Wu, Ao Tang, "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting". In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. [project page] [pdf] [code] [poster] [slide]

Preprints & Under Submission

Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, Honggang Chen, "Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration". arXiv preprint arXiv:2501.05179. [pdf] [github]

Xinyang Tong, Pengxiang Ding, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Yiguo Fan, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu, "QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning". arXiv preprint arXiv:2412.15576. [pdf] [project page]

Bofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin Wang, "Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation". arXiv preprint arXiv:2412.09265. [pdf] [project page]

Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang, "CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction". arXiv preprint arXiv:2412.06782. [pdf] [project page] [huggingface paper]

Yuhang Han†, Xuyang Liu†, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang✉, "Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration". arXiv preprint arXiv:2411.17686. [pdf] [project page] [huggingface paper] [github]

Chang Zou†, Xuyang Liu†, Ting Liu, Siteng Huang, Linfeng Zhang, "Accelerating Diffusion Transformers with Token-wise Feature Caching". arXiv preprint arXiv:2410.05317. [pdf] [github]

Fengyuan Dai, Siteng Huang, Min Zhang, Biao Gong, Donglin Wang, "Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning". arXiv preprint arXiv:2408.17083. [pdf]

Xuyang Liu†, Ting Liu†, Siteng Huang, Yue Hu, Quanjun Yin, Donglin Wang, Honggang Chen, "M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension". arXiv preprint arXiv:2407.01131. [pdf]

Ting Liu†, Xuyang Liu†, Siteng Huang, Liangtao Shi, Zunnan Xu, Yi Xin, Quanjun Yin, Xiaohong Liu, "Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference". arXiv preprint arXiv:2405.14700. [pdf] [github]

Thesis

Siteng Huang, "Model Transfer for Multimodal Understanding and Generation". Zhejiang University, 2024.