Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NoteLLM-2: Multimodal Large Representation Models for Recommendation (2405.16789v2)

Published 27 May 2024 in cs.IR

Abstract: LLMs have demonstrated exceptional proficiency in text understanding and embedding tasks. However, their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored. While leveraging existing Multimodal LLMs (MLLMs) for such tasks is promising, challenges arise due to their delayed release compared to corresponding LLMs and the inefficiency in representation tasks. To address these issues, we propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation. Preliminary experiments revealed that fine-tuned LLMs often neglect image content. To counteract this, we propose NoteLLM-2, a novel framework that enhances visual information. Specifically, we propose two approaches: first, a prompt-based method that segregates visual and textual content, employing a multimodal In-Context Learning strategy to balance focus across modalities; second, a late fusion technique that directly integrates visual information into the final representations. Extensive experiments, both online and offline, demonstrate the effectiveness of our approach. Code is available at https://github.com/Applied-Machine-Learning-Lab/NoteLLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Llama3. https://llama.meta.com/llama3/.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  4. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
  5. TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model. TKDD.
  6. On the Difference of BERT-style and CLIP-style Text Encoders. In Findings of ACL, pages 13710–13721.
  7. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  8. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
  9. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3504–3514.
  10. MAPS: multimodal attention for product similarity. In WACV, pages 3338–3346.
  11. DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.
  12. A survey on in-context learning. arXiv preprint arXiv:2301.00234.
  13. DreamLLM: Synergistic Multimodal Comprehension and Creation. In The Twelfth International Conference on Learning Representations.
  14. An empirical study of training end-to-end vision-and-language transformers. In CVPR, pages 18166–18176.
  15. Planting a SEED of Vision in Large Language Model. In ICLR.
  16. Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks. In CVPR, pages 2669–2680.
  17. OpenCLIP. If you use this software, please cite it as below.
  18. Mistral 7B. arXiv preprint arXiv:2310.06825.
  19. Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645.
  20. Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce. In CVPR, pages 11060–11069.
  21. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  22. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865.
  23. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pages 4171–4186.
  24. What matters when building vision-language models? arXiv preprint arXiv:2405.02246.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR.
  26. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR.
  27. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705.
  28. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
  29. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  30. Multimodal recommender systems: A survey. arXiv preprint arXiv:2302.03883.
  31. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
  32. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424.
  33. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. arXiv preprint arXiv:2403.09611.
  34. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906.
  35. MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316.
  36. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
  37. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR.
  38. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pages 3505–3506.
  39. {{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} model training. In USENIX Annual Technical Conference, pages 551–564.
  40. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  41. Repetition Improves Language Model Embeddings. arXiv preprint arXiv:2402.15449.
  42. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
  43. RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In COLING, pages 1852–1862.
  44. RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In AAAI, volume 35, pages 13860–13868.
  45. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
  46. EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters. arXiv preprint arXiv:2402.04252.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  48. Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. In EMNLP, pages 9840–9855.
  49. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, pages 19175–19186.
  50. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In MM, pages 1437–1445.
  51. Mm-rec: Visiolinguistic model empowered multimodal news recommendation. In SIGIR, pages 2560–2564.
  52. Large language models for generative information extraction: A survey. arXiv preprint arXiv:2312.17617.
  53. Why do we click: visual impression-aware news recommendation. In MM, pages 3881–3890.
  54. Large scale product graph construction for recommendation in e-commerce. arXiv preprint arXiv:2010.05525.
  55. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
  56. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
  57. Commercemm: Large-scale commerce multimodal representation learning with omni retrieval. In KDD, pages 4433–4442.
  58. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986.
  59. Long-clip: Unlocking the long-text capability of clip. arXiv preprint arXiv:2403.15378.
  60. NoteLLM: A Retrievable Large Language Model for Note Recommendation. arXiv preprint arXiv:2403.01744.
  61. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing, 14(3):478–493.
  62. Debiasing Large Visual Language Models. arXiv preprint arXiv:2403.05262.
  63. Make: Vision-language pre-training based product retrieval in taobao search. In WWW, pages 356–360.
  64. Learning tree-based deep model for recommender systems. In KDD, pages 1079–1088.
Citations (7)

Summary

We haven't generated a summary for this paper yet.