Emergent Mind

Many-Shot In-Context Learning in Multimodal Foundation Models

(2405.09798)
Published May 16, 2024 in cs.LG , cs.AI , cs.CL , and cs.CV

Abstract

Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .

Overview

  • The study explores the advancements in large context windows for many-shot in-context learning (ICL) using multimodal models, demonstrating significant performance improvements on various datasets.

  • Two state-of-the-art multimodal models, GPT-4o and Gemini 1.5. Pro, are tested across 10 datasets from different domains, showing how Gemini 1.5. Pro consistently outperforms GPT-4o, especially in terms of data efficiency and cost savings through batch querying.

  • The research highlights both practical and theoretical implications, including enhanced model adaptability and efficiency, and suggests directions for future studies to further optimize multimodal foundation models.

Many-Shot In-Context Learning in Multimodal Foundation Models

Introduction

As LLMs continue to evolve, their ability to perform various tasks with minimal in-context examples has been a significant highlight. Recently, advancements in the context window size have opened up new opportunities, particularly in the "many-shot" in-context learning (ICL) space. This study investigates the performance improvements brought by many-shot ICL using two state-of-the-art multimodal models: GPT-4o and Gemini 1.5. Pro.

Methods and Datasets

Models Used

  1. GPT-4o: Known for its robust performance across various NLP tasks.
  2. Gemini 1.5. Pro: A newer entrant allowing up to one million tokens, providing significantly larger context windows.

Datasets

The study benchmarks performance across 10 datasets spanning multiple domains and tasks, including natural imagery, medical imagery, and molecular imagery. Here's a breakdown:

  • Natural Imagery: HAM10000, UCMerced, Oxford Pets, DTD
  • Medical Imagery: FIVES, CheXpert, Camelyon17
  • Remote Sensing: EuroSAT
  • Molecular Imagery: DrugOOD Assay

Each dataset is evaluated using performance metrics like accuracy and F1 score, and models are tested on multi-class, multi-label, and fine-grained classification tasks.

Key Findings

Performance Improvements

Many-Shot Effectiveness:

  • Gemini 1.5. Pro consistently exhibited log-linear performance improvements as more demonstrating examples were added, with noticeable gains on datasets like HAM10000 (+23%), FIVES (+29%), and EuroSAT (+38%).
  • GPT-4o also improved with many-shot ICL but showed less stability compared to Gemini 1.5. Pro, with a V-shaped performance trend.

ICL Data Efficiency:

  • Gemini 1.5. Pro outperformed GPT-4o in terms of ICL data efficiency on most datasets, with the highest efficiency observed on EuroSAT.

Impact of Batch Querying

Including a single query at a time proved suboptimal for many datasets. Batching (grouping multiple queries) showed:

  • Minimal to no degradation in performance for Gemini 1.5. Pro across large batch sizes.
  • Substantial latency and cost savings.

Cost and Latency Analysis

Many-shot ICL can be computationally expensive due to long input contexts. With batching:

  • A near 35x reduction in latency and 10x in cost was observed for HAM10000.
  • For TerraIncognita, latency reduced by 20x, and cost by 45x.

Implications and Future Directions

Practical Implications:

  1. Operational Efficiency: Many-shot ICL can significantly enhance the adaptability of multimodal models, allowing for quick adaptation to new tasks without the need for extensive fine-tuning.
  2. Cost Reduction: The ability to batch queries effectively reduces computational costs and inference latency, making the deployment of these models more feasible in real-world applications.

Theoretical Implications:

  1. Model Robustness: The findings suggest that with larger context windows, models can leverage more in-context examples, leading to improved robustness and performance consistency.
  2. Understanding Model Behavior: Investigating the reasons behind the improvements seen with batch querying, such as domain and class calibration, provides deeper insights into how models can be further optimized.

Looking Ahead

Many-shot ICL represents a substantial stride for multimodal foundation models. As context window sizes continue to expand, the ability to leverage a large number of demonstrating examples will likely improve further. Ongoing research should explore:

  • Comparative studies with traditional fine-tuning to evaluate performance and efficiency trade-offs.
  • Detailed investigation of biases and hallucinations in the context of many-shot ICL.
  • Extension to other tasks and open-source multimodal models.

In summary, this study underscores the capability of multimodal foundation models to benefit significantly from many-shot ICL, paving the way for more efficient and adaptable AI applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube