VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization (2404.19652v4)
Abstract: Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io.
- X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan, “FOTS: Fast oriented text spotting with a unified network,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5676–5685, 2018.
- M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask TextSpotter v3: Segmentation proposal network for robust scene text spotting,” in Proc. Eur. Conf. Comp. Vis., pp. 706–722, 2020.
- Y. Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “ABCNet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 8048–8064, 2022.
- X. Zhang, Y. Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9519–9528, 2022.
- Y. Kittenplon, I. Lavi, S. Fogel, Y. Bar, R. Manmatha, and P. Perona, “Towards weakly-supervised text spotting using a multi-task transformer,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4604–4613, 2022.
- D. Peng, X. Wang, Y. Liu, J. Zhang, M. Huang, S. Lai, J. Li, S. Zhu, D. Lin, C. Shen, et al., “SPTS: Single-point text spotting,” in Proc. ACM Int. Conf. Multimedia, pp. 4272–4281, 2022.
- W. Yu, Y. Liu, X. Zhu, H. Cao, X. Sun, and X. Bai, “Turning a clip model into a scene text spotter,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–12, 2024.
- D. Karatzas, L. Gomez-Bigorda, et al., “ICDAR 2015 competition on robust reading,” in Proc. IAPR Int. Conf. Document Analysis Recog., pp. 1156–1160, 2015.
- D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “ICDAR 2013 robust reading competition,” in Proc. IAPR Int. Conf. Document Analysis Recog., pp. 1484–1493, 2013.
- C.-K. Ch’ng, C. S. Chan, and C.-L. Liu, “Total-Text: toward orientation robustness in scene text detection,” Int. J. Document Analysis Recogn., pp. 1–22, 2019.
- Y. Liu, L. Jin, S. Zhang, C. Luo, and S. Zhang, “Curved scene text detection via transverse and longitudinal sequence connection,” Pattern Recogn., vol. 90, pp. 337–345, 2019.
- Y. Zhao, W. Wu, Z. Li, J. Li, and W. Wang, “Flowtext: Synthesizing realistic scene text video with optical flow estimation,” in IEEE Int Conf. on Multimedia and Expo, pp. 1517–1522, 2023.
- A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2315–2324, 2016.
- M. Huang, J. Zhang, D. Peng, H. Lu, C. Huang, Y. Liu, X. Bai, and L. Jin, “ESTextSpotter: Towards better scene text spotting with explicit synergy in transformer,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 19495–19505, 2023.
- H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q. Chen, and Y. Shen, “Codef: Content deformation fields for temporally consistent video processing,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024.
- M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” Int. J. Comput. Vision, vol. 116, no. 6, pp. 1–20, 2016.
- H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 5238–5246, 2017.
- T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and attention,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5020–5029, 2018.
- M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, “Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, pp. 532–548, 2019.
- W. Feng, W. He, F. Yin, X.-Y. Zhang, and C.-L. Liu, “TextDragon: An end-to-end framework for arbitrary shaped text spotting,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9076–9085, 2019.
- S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao, “Towards unconstrained end-to-end text spotting,” Proc. IEEE Int. Conf. Comp. Vis., pp. 4704–4714, 2019.
- W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Y. Zhibo, T. Lu, and C. Shen, “PAN++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 9, pp. 5349–5367, 2022.
- W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen, “Efficient and accurate arbitrary-shaped text detection with pixel aggregation network,” Proc. IEEE Int. Conf. Comp. Vis., pp. 8440–8449, 2019.
- Y. Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “ABCNet: Real-time scene text spotting with adaptive bezier-curve network,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9809–9818, 2020.
- S. Fang, Z. Mao, H. Xie, Y. Wang, C. Yan, and Y. Zhang, “ABINet++: Autonomous, bidirectional and iterative language modeling for scene text spotting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 6, pp. 7123 – 7141, 2023.
- S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7098–7107, 2021.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” Proc. Int. Conf. Learn. Representations, 2021.
- M. Huang, Y. Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4593–4603, 2022.
- Y. Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Bai, et al., “SPTS v2: Single-point scene text spotting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 15665–15679, 2023.
- M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “Deepsolo: Let transformer decoder with explicit points solo for text spotting,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 19348–19357, 2023.
- X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, “Text detection, tracking and recognition in video: a comprehensive survey,” IEEE Trans. Image Process., vol. 25, no. 6, pp. 2752–2773, 2016.
- X. Wang, Y. Jiang, S. Yang, X. Zhu, W. Li, P. Fu, H. Wang, and Z. Luo, “End-to-end scene text recognition in videos based on multi frame tracking,” in Proc. IAPR Int. Conf. Document Analysis Recog., vol. 1, pp. 1255–1260, 2017.
- Z. Cheng, J. Lu, Y. Niu, S. Pu, F. Wu, and S. Zhou, “You only recognize once: Towards fast video text spotting,” in Proc. ACM Int. Conf. Multimedia, pp. 855–863, 2019.
- Z. Cheng, J. Lu, B. Zou, L. Qiao, Y. Xu, S. Pu, Y. Niu, F. Wu, and S. Zhou, “Free: A fast and robust end-to-end video text spotter,” IEEE Trans. Image Process., vol. 30, pp. 822–837, 2020.
- P. X. Nguyen, K. Wang, and S. Belongie, “Video text detection and recognition: Dataset and benchmark,” in Proc. Winter Conf. Appl. Comp. Vision, pp. 776–783, 2014.
- X. Rong, C. Yi, X. Yang, and Y. Tian, “Scene text recognition in multiple frames based on text tracking,” in IEEE Int Conf. on Multimedia and Expo, pp. 1–6, 2014.
- W. Wu, D. Zhang, Y. Cai, S. Wang, J. Li, Z. Li, Y. Tang, and H. Zhou, “A bilingual, openworld video text dataset and end-to-end video text spotter with transformer,” In Proc. Advances in Neural Inf. Process. Syst. Track on Datasets and Benchmarks, pp. 1–10, 2021.
- F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, “MOTR: End-to-end multiple-object tracking with transformer,” in Proc. Eur. Conf. Comp. Vis., pp. 659–675, 2022.
- W. Wu, D. Zhang, Y. Fu, C. Shen, H. Zhou, Y. Cai, and P. Luo, “End-to-end video text spotting with transformer,” Int. J. Comput. Vision, pp. 1–11, 2024.
- X. Zu, H. Yu, B. Li, and X. Xue, “Towards accurate video text spotting with text-wise semantic reasoning,” in Proc. Int. Joint Conf. Artificial Intell., pp. 1858–1866, 2023.
- E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7167–7176, 2017.
- K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Universal domain adaptation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2720–2729, 2019.
- C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” IEEE Trans. Neural Netw. & Learn. Syst., pp. 1–11, 2023.
- D. Chen, L. Lu, Y. Lu, R. Yu, S. Wang, L. Zhang, and T. Liu, “Cross-domain scene text detection via pixel and image-level adaptation,” in Int. Conf. Neural Inf. Process., pp. 135–143, 2019.
- W. Wu, N. Lu, E. Xie, Y. Wang, W. Yu, C. Yang, and H. Zhou, “Synthetic-to-real unsupervised domain adaptation for scene text detection in the wild,” in Proc. Asian Conf. Comp. Vis., 2020.
- Y. Chen, W. Wang, Y. Zhou, F. Yang, D. Yang, and W. Wang, “Self-training for domain adaptive scene text detection,” in Proc. IAPR Int. Conf. Document Analysis Recog., pp. 850–857, 2021.
- W. Yu, Y. Liu, W. Hua, D. Jiang, B. Ren, and X. Bai, “Turning a clip model into a scene text detector,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6978–6988, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in In Proc. Int. Conf. Mach. Learn., pp. 8748–8763, PMLR, 2021.
- Y. Zhang, S. Nie, W. Liu, X. Xu, D. Zhang, and H. T. Shen, “Sequence-to-sequence domain adaptation network for robust text image recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2740–2749, 2019.
- F. Zhan, C. Xue, and S. Lu, “Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9105–9115, 2019.
- W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct regression for multi-oriented scene text detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 745–753, 2017.
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in Proc. Int. Conf. Mach. Learn., pp. 2790–2799, 2019.
- Y. Liu, J. Wu, and Y. Fu, “Collaborative tracking learning for frame-rate-insensitive multi-object tracking,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9964–9973, 2023.
- J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9777–9786, 2021.
- C. Xu, S.-H. Hsieh, C. Xiong, and J. J. Corso, “Can humans fly? action understanding with multiple classes of actors,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2264–2273, 2015.
- Y. Zhang, H. Doughty, L. Shao, and C. G. Snoek, “Audio-adaptive activity recognition across video domains,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 13791–13800, 2022.
- G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari, “Actor and observer: Joint modeling of first and third-person videos,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7396–7404, 2018.
- M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1194–1201, 2012.
- M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2929–2936, 2009.
- H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 780–787, 2014.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comp. Vis., pp. 213–229, 2020.
- S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in Int. Conf. Learn. Representations, 2022.
- H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 2980–2988, 2017.
- H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 658–666, 2019.
- L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 3836–3847, 2023.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023.
- X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 1905–1914, 2021.
- Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, “Segment and track anything,” arXiv preprint arXiv:2305.06558, 2023.
- Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Proc. Eur. Conf. Comp. Vis., pp. 402–419, 2020.
- M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
- W. Wang, Y. Zhou, J. Lv, D. Wu, G. Zhao, N. Jiang, and W. Wang, “Tpsnet: Reverse thinking of thin plate splines for arbitrary shape scene text representation,” in Proc. ACM Int. Conf. Multimedia, pp. 5014–5025, 2022.
- M. Ye, J. Zhang, S. Zhao, J. Liu, B. Du, and D. Tao, “DPText-DETR: Towards better scene text detection with dynamic points in transformer,” in Proc. AAAI Conf. Artificial Intell., pp. 3241–3249, 2023.
- N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al., “ICDAR 2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT,” in Proc. IAPR Int. Conf. Document Analysis Recog., vol. 1, pp. 1454–1459, 2017.
- B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2963–2970, 2010.
- X. Zhao, K.-H. Lin, Y. Fu, Y. Hu, Y. Liu, and T. S. Huang, “Text from corners: a novel approach to detect text and caption in videos,” IEEE Trans. Image Process., vol. 20, no. 3, pp. 790–799, 2010.
- X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text detection in natural scene images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 5, pp. 970–983, 2013.
- V. Khare, P. Shivakumara, R. Paramesran, and M. Blumenstein, “Arbitrarily-oriented multi-lingual text detection in video,” Multimedia Tools and Applications, vol. 76, pp. 16625–16655, 2017.
- L. Wang, Y. Wang, S. Shan, and F. Su, “Scene text detection and tracking in video with background cues,” in Proc. ACM on Int. Conf. on Multimedia Retrieval, pp. 160–168, 2018.
- P. Shivakumara, L. Wu, T. Lu, C. L. Tan, M. Blumenstein, and B. S. Anami, “Fractals based multi-oriented text detection system for recognition in mobile video images,” Pattern Recogn., vol. 68, pp. 158–174, 2017.
- L. Wu, P. Shivakumara, T. Lu, and C. L. Tan, “A new technique for multi-oriented scene text line detection and tracking in video,” IEEE Transactions on multimedia, vol. 17, no. 8, pp. 1137–1152, 2015.
- H. Yu, Y. Huang, L. Pi, C. Zhang, X. Li, and L. Wang, “End-to-end video text detection with online tracking,” Pattern Recogn., vol. 113, p. 107791, 2021.
- W. Feng, F. Yin, X.-Y. Zhang, and C.-L. Liu, “Semantic-aware video text detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1695–1705, 2021.
- X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “EAST: An efficient and accurate scene text detector,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5551–5560, 2017.
- X. Wang, Y. Jiang, Z. Luo, C.-L. Liu, H. Choi, and S. Kim, “Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6449–6458, 2019.
- M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text detection with differentiable binarization,” in Proc. AAAI Conf. Artificial Intell., pp. 11474–11481, 2020.
- M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiable binarization and adaptive scale fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 919–931, 2022.
- W. Wu, Y. Zhang, Y. He, L. Zhang, Z. Lou, H. Zhou, and X. Bai, “DSText V2: A comprehensive video text spotting dataset for dense and small text,” Pattern Recogn., vol. 149, p. 110177, 2024.
- X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in Proc. Eur. Conf. Comp. Vis., pp. 474–490, Springer, 2020.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” Proc. Int. Conf. Learn. Representations, 2018.
- Y. Shi, D. Peng, W. Liao, Z. Lin, X. Chen, C. Liu, Y. Zhang, and L. Jin, “Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation,” arXiv preprint arXiv:2310.16809, 2023.
- J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
- H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023.
- Q. Ye, H. Xu, J. Ye, M. Yan, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” arXiv preprint arXiv:2311.04257, 2023.
- Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” arXiv preprint arXiv:2311.06607, 2023.
- Y. Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for understanding document,” arXiv preprint arXiv:2403.04473, 2024.
- X. Dong, P. Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, X. Wei, S. Zhang, H. Duan, M. Cao, W. Zhang, Y. Li, H. Yan, Y. Gao, X. Zhang, W. Li, J. Li, K. Chen, C. He, X. Zhang, Y. Qiao, D. Lin, and J. Wang, “Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model,” arXiv preprint arXiv:2401.16420, 2024.
- G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.