LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (2311.17043v1)
Abstract: In this work, we present a novel method to tackle the token generation challenge in Vision LLMs (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID
- Sharegpt. https://sharegpt.com/, 2023.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Language models are few-shot learners. In NeurIPS, 2020.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023a.
- Extending context window of large language models via positional interpolation. arXiv:2306.15595, 2023b.
- Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023c.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Movienet: A holistic dataset for movie understanding. In ECCV, 2020.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Lisa: Reasoning segmentation via large language model. arXiv:2308.00692, 2023.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv:2307.16125, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023b.
- Videochat: Chat-centric video understanding. arXiv:2305.06355, 2023c.
- Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023d.
- Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a.
- Visual instruction tuning. In NeruIPS, 2023b.
- One for all: Video conversation is feasible without video instruction tuning. arXiv:2309.15785, 2023c.
- Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692, 2019.
- Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023d.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- Valley: Video assistant with large language model enhanced ability. arXiv:2306.07207, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016a.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016b.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
- OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023b.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
- Towards vqa models that can read. In CVPR, 2019.
- Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864, 2021.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
- Attention is all you need. In NeurIPS, 2017.
- Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
- Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023a.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv:2309.17421, 2023b.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023a.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv:2303.16199, 2023b.
- Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.