Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification (2405.17790v1)
Abstract: Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a novel instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Instruct-ReID is the first exploration of a general ReID setting, where existing 6 ReID tasks can be viewed as special cases by assigning different instructions. To facilitate research in this new instruct-ReID task, we propose a large-scale OmniReID++ benchmark equipped with diverse data and comprehensive evaluation methods e.g., task specific and task-free evaluation settings. In the task-specific evaluation setting, gallery sets are categorized according to specific ReID tasks. We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework. For task-free evaluation setting, where target person images are retrieved from task-agnostic gallery sets, we further propose a new method called IRM++ with novel memory bank-assisted learning. Extensive evaluations of IRM and IRM++ on OmniReID++ benchmark demonstrate the superiority of our proposed methods, achieving state-of-the-art performance on 10 test sets. The datasets, the model, and the code will be available at https://github.com/hwz-zju/Instruct-ReID
- Scalable person re-identification: A benchmark. In CVPR, 2015.
- Person re-identification by camera correlation aware feature augmentation. TPAMI, 2017.
- Person transfer gan to bridge domain gap for person re-identification. In CVPR, 2018.
- Online pseudo label generation by hierarchical cluster dynamics for adaptive person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8371–8381, October 2021.
- Pose-guided representation learning for person re-identification. IEEE transactions on pattern analysis and machine intelligence, 44(2):622–635, 2019.
- Clothing status awareness for long-term person re-identification. In CVPR, 2021.
- Clothes-changing person re-identification with rgb modality only. In CVPR, 2022.
- Cloth-changing person re-identification from a single image with gait prediction and regularization. In CVPR, 2022.
- Semantic-guided pixel sampling for cloth-changing person re-identification. SPL, 2021.
- Fine-grained shape-appearance mutual learning for cloth-changing person re-identification. In CVPR, 2021.
- Cocas: A large-scale clothes changing person dataset for re-identification. In CVPR, 2020.
- Cocas+: Large-scale clothes-changing person re-identification with clothes templates. TCSVT, 2022.
- Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19366–19375, 2022.
- Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2843–2851, 2022.
- Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14308–14317, 2022.
- Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7349–7358, 2022.
- Channel augmentation for visible-infrared re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(4):2299–2315, 2024.
- Rasa: Relation and sensitivity aware representation learning for text-based person search. arXiv preprint arXiv:2305.13653, 2023.
- Improving deep visual representation for person re-identification by global and local image-language association. In ECCV, 2018.
- Dual-path convolutional image-text embeddings with instance loss. TOMM, 2020.
- Person search with natural language description. In CVPR, 2017.
- Adversarial attribute-image person re-identification. arXiv preprint arXiv:1712.01493, 2017.
- Person re-identification meets image search. arXiv preprint arXiv:1502.02171, 2015.
- Person re-identification by contour sketch under moderate clothing change. TPAMI, 2019.
- Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
- Pass: Part-aware self-supervised pre-training for person re-identification. In European Conference on Computer Vision, pages 198–214. Springer, 2022.
- Instruct-reid: A multi-purpose person re-identification task with instructions, 2023.
- Deep learning for person re-identification: A survey and outlook. TPAMI, 2021.
- Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
- Person re-identification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1367–1376, 2017.
- Svdnet for pedestrian retrieval. In Proceedings of the IEEE international conference on computer vision, pages 3800–3808, 2017.
- Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, 2017.
- Adversarially occluded samples for person re-identification. In CVPR, 2018.
- Re-identification with consistent attentive siamese networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5735–5744, 2019.
- Unsupervised tracklet person re-identification. IEEE transactions on pattern analysis and machine intelligence, 42(7):1770–1782, 2019.
- When does label smoothing help? Advances in neural information processing systems, 32, 2019.
- A siamese long short-term memory architecture for human re-identification. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 135–153. Springer, 2016.
- Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 994–1003, 2018.
- Learning invariance from generated variance for unsupervised person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- A self-supervised gait encoding approach with locality-awareness for 3d skeleton based person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6649–6666, 2021.
- In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
- Resource aware person re-identification across multiple resolutions. In CVPR, 2018.
- Embedding deep metric for person re-identification: A study against large variations. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 732–748. Springer, 2016.
- Feature completion for occluded person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4894–4912, 2021.
- Learning part-based convolutional features for person re-identification. IEEE transactions on pattern analysis and machine intelligence, 43(3):902–917, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
- Coot: Cooperative hierarchical transformer for video-text representation learning. NeurIPS, 2020.
- Self-supervised multimodal versatile networks. NeurIPS, 2020.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, 2021.
- Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996, 2021.
- Bridgeformer: Bridging video-text retrieval with multiple choice questions. arXiv preprint arXiv:2201.04850, 2022.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6388–6397, 2020.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
- Linchao Zhu and Yi Yang. Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):273–285, 2020.
- Memory-based cross-image contexts for weakly supervised semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 45(5):6006–6020, 2022.
- Bo Ji and Angela Yao. Multi-scale memory-based video deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1919–1928, 2022.
- Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4492–4501, 2023.
- Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2153–2162, 2023.
- Large-scale spatio-temporal person re-identification: Algorithms and benchmark. IEEE Transactions on Circuits and Systems for Video Technology, 32(7):4390–4403, 2021.
- Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11174–11184, 2023.
- Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
- When person re-identification meets changing clothes. In CVPR Workshops, 2020.
- Long-term cloth-changing person re-identification. In ACCV, 2020.
- A benchmark for clothes variation in person re-identification. International Journal of Intelligent Systems, 35(12):1881–1898, 2020.
- Long-term person re-identification with dramatic appearance change: Algorithm and benchmark. In Proceedings of the 30th ACM International Conference on Multimedia, pages 6406–6415, 2022.
- Plip: Language-image pre-training for person representation learning. arXiv preprint arXiv:2305.08386, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Self-correction for human parsing. TPAMI, 2020.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Harmonious attention network for person re-identification. In CVPR, 2018.
- Relation-aware global attention for person re-identification. In CVPR, 2020.
- Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
- Interaction-and-aggregation network for person re-identification. In CVPR, 2019.
- Transreid: Transformer-based object re-identification. In CVPR, 2021.
- Channel augmented joint learning for visible-infrared recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13567–13576, 2021.
- Towards a unified middle modality learning for visible-infrared person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, pages 788–796, 2021.
- Learning semantic-aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2724–2728. IEEE, 2022.
- Text-based person search with limited data. arXiv preprint arXiv:2110.10807, 2021.
- Semantics-aligned representation learning for person re-identification. In AAAI, 2020.
- Humanbench: Towards general human-centric perception with projector assisted pretraining. arXiv preprint arXiv:2303.05675, 2023.
- Beyond appearance: A semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15050–15061, June 2023.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
- Hap: Structure-aware masked image modeling for human-centric perception. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.