Context-PEFT: Efficient Multi-Modal, Multi-Task Fine-Tuning (2312.08900v1)

Published 14 Dec 2023 in cs.LG

Abstract: This paper introduces a novel Parameter-Efficient Fine-Tuning (PEFT) framework for multi-modal, multi-task transfer learning with pre-trained LLMs. PEFT techniques such as LoRA, BitFit and IA3 have demonstrated comparable performance to full fine-tuning of pre-trained models for specific downstream tasks, all while demanding significantly fewer trainable parameters and reduced GPU memory consumption. However, in the context of multi-modal fine-tuning, the need for architectural modifications or full fine-tuning often becomes apparent. To address this we propose Context-PEFT, which learns different groups of adaptor parameters based on the token's domain. This approach enables LoRA-like weight injection without requiring additional architectural changes. Our method is evaluated on the COCO captioning task, where it outperforms full fine-tuning under similar data constraints while simultaneously offering a substantially more parameter-efficient and computationally economical solution.

Summary

The paper presents Context-PEFT, a novel framework that utilizes context-driven parameter-efficient fine-tuning to adapt multi-modal models without changing the architecture.
It introduces adaptable adaptor parameters based on token domain, significantly enhancing performance in attention layers for vision tasks.
Evaluation on the COCO captioning task demonstrates that Context-PEFT improves both efficiency and accuracy compared to full fine-tuning methods.

Introduction to Context-PEFT

The paper discusses a novel framework for fine-tuning pre-trained LLMs - named Context-PEFT - which enables efficient transfer learning across multiple modalities and tasks. The need for Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA, BitFit, and IA has grown as LLMs become larger and more computationally demanding. However, when dealing with multi-modal data, PEFT often requires significant architectural adjustments. Context-PEFT responds to this challenge by streamline adaptation without changing the model architecture, learning different adaptor parameters based on token domain. This approach is evaluated on the COCO captioning task and shows improved performance, efficiency, and computational economy over full fine-tuning.

Previous Work and PEFT Techniques

PEFT has become popular as it enables the utilization of large LLMs without extensive resource requirements. Techniques can be categorized based on how they alter the model architecture and introduce new weights, with methods focusing on training only a subset of parameters or introducing learnable vectors that adjust activations within layers. Vision-LLMs come in various forms, some using dual-encoder systems for classification and retrieval tasks, and others, like Large Vision LLMs (LVLMs), being better suited for generative tasks such as captioning. Among LVLMs, those utilizing causal attention have shown particular promise when adapting pre-trained text-only models to multimodal tasks.

Our Context-PEFT Approach

The proposed Context-PEFT framework involves freezing the vision encoder and adapting image embeddings for the LLM, differentiating itself by using a context-driven PEFT method rather than full fine-tuning. This method is adaptable to a range of PEFT techniques and does not need any auxiliary data outside of standard datasets during training and evaluation. The focus of Context-PEFT is efficiency, with the goal of selecting models that maintain high performance with a reasonable computational footprint.

Evaluation and Findings

The researchers conducted rigorous testing across various PEFT methods, finding that context-specific adaptors yielded better results than context-agnostic variants. Adapting feed-forward layers often led to improvements, but the attention layers saw larger performance gains from context-specific adaptation, suggesting the significance of controlling inter-modality interactions. An analysis of vision transformer sizes revealed that while higher quality image tokens from more parameter-heavy encoders improved performance, Context-PEFT still enhanced outcomes when considering trainable parameters. Finally, attention map observations revealed the model's ability to discern semantically relevant regions within images, further validating the performance of the method and suggesting potential applications for tasks like panoptic segmentation.

The findings underline Context-PEFT's potential as a solution for environments constrained by data, compute resources, and operational efficiency. The framework provides a competitive alternative to full fine-tuning, particularly in scenarios where resources are limited. Future work could explore extending Context-PEFT to other modalities and tasks, further improving its versatility and effectiveness.