pisco_log
banner

A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving

Jialin Li, Shuqi Wu, Ning Wang

Abstract


Re-Identification (ReID) is a critical technology in intelligent perception systems, especially within autonomous driving, where
onboard cameras must identify pedestrians across views and time in real-time to support safe navigation and trajectory prediction. However,
the presence of uncertain or missing input modalitiessuch as RGB, infrared, sketches, or textual descriptionsposes significant challenges
to conventional ReID approaches. While large-scale pre-trained models offer strong multimodal semantic modeling capabilities, their computational overhead limits practical deployment in resource-constrained environments. To address these challenges, we propose a lightweight
Uncertainty Modal Modeling (UMM) framework, which integrates a multimodal token mapper, synthetic modality augmentation strategy,
and cross-modal cue interactive learner. Together, these components enable unified feature representation, mitigate the impact of missing
modalities, and extract complementary information across different data types. Additionally, UMM leverages CLIPs vision-language alignment ability to fuse multimodal inputs efficiently without extensive fine-tuning. Experimental results demonstrate that UMM achieves strong
robustness, generalization, and computational efficiency under uncertain modality conditions, offering a scalable and practical solution for
pedestrian re-identification in autonomous driving scenarios.

Keywords


Person Re-Identification; Multimodal learning; CLIP; Uncertainty modal modeling; Autonomous driving

Full Text:

PDF

Included Database


References


[1] Zhang D, Zhang Z, Ju Y, et al. (2022). Dual mutual learning for cross-modality person Re-Identification[J]. IEEE Transactions on

Circuits and Systems for Video Technology, 32(8): 5361-5373.

[2] Dong N, Yan S, Tang H, et al. (2024). Multi-view information integration and propagation for occluded person Re-Identification[J].

Information Fusion, 104: 102201.

[3] Song C, Huang Y, Ouyang W, et al. (2018). Mask-guided contrastive attention model for person Re-Identification[C]. Proceedings of the

IEEE conference on computer vision and pattern recognition, 1179-1188.

[4] Liang P P, Zadeh A, Morency L P. (2022). Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open

Questions[J]. arXiv preprint arXiv:2209.03430.

[5] Gabeur V, Sun C, Alahari K, et al. (2022) Multi-modal transformer for video retrieval[C], Computer VisionECCV 2020: 16th European

Conference, 214-229.

[6] Botach A, Zheltonozhskii E, Baskin C. (2022). End-to-end referring video object segmentation with multimodal transformers[C],

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4985-4995.

[7] Cai L, Wang Z, Gao H, et al. (2018). Deep adversarial learning for multi-modality missing data completion[C], Proceedings of the 24th

ACM SIGKDD international conference on knowledge discovery & data mining, 1158-1166.

[8] Girdhar R, El-Nouby A, Liu Z, et al. (2023). Imagebind: One embedding space to bind them all[C], Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, 15180-15190.

[9] Ma M, Ren J, Zhao L, et al. (2021). Smil: Multimodal learning with severely missing modality[C], Proceedings of the AAAI Conference

on Artificial Intelligence, 35(3): 2302-2310.

[10] Ma M, Ren J, Zhao L, et al. (2021). Smil: Multimodal learning with severely missing modality[C], Proceedings of the AAAI Conference

on Artificial Intelligence, 35(3): 2302-2310.

[11] Pan X, Luo P, Shi J, et al. (2018). Two at once: Enhancing learning and generalization capacities via ibn-net[C], Proceedings of the

european conference on computer vision (ECCV), 464-479.




DOI: http://dx.doi.org/10.70711/aitr.v2i10.7149

Refbacks

  • There are currently no refbacks.