Less is More: Selective Layer Finetuning with SubTuning (2302.06354v3)

Published 13 Feb 2023 in cs.LG and cs.AI

Abstract: Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance. In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, keeping the rest of the weights frozen at their initial (pretrained) values. We demonstrate that \emph{subset finetuning} (or SubTuning) often achieves accuracy comparable to full finetuning of the model, and even surpasses the performance of full finetuning when training data is scarce. Therefore, SubTuning allows deploying new tasks at minimal computational cost, while enjoying the benefits of finetuning the entire model. This yields a simple and effective method for multi-task learning, where different tasks do not interfere with one another, and yet share most of the resources at inference time. We demonstrate the efficiency of SubTuning across multiple tasks, using different network architectures and pretraining methods.

References (69)

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that selectively finetuning a subset of layers outperforms full finetuning in low-data and distribution-shift scenarios.
The paper employs a finetuning profile and a greedy algorithm to identify layers that offer the highest accuracy gains with minimal computational cost.
The paper provides theoretical insights showing that reducing the number of trained parameters lowers generalization error and enables efficient multi-task deployment.

This paper introduces SubTuning, a parameter-efficient transfer learning method that serves as a middle ground between full finetuning (training all parameters) and linear probing (training only the final classification head). The core idea is to selectively finetune only a carefully chosen subset of layers from a pretrained model while keeping the rest frozen. This approach aims to balance model adaptation capacity with parameter efficiency, proving particularly beneficial in low-data regimes, under distribution shifts, and for efficient multi-task deployment.

Key Concepts and Implementation

Finetuning Profile: To understand which layers are most important for a given downstream task, the paper introduces the "finetuning profile".
- Generation: This profile is created by systematically finetuning only one layer (or a small block of consecutive layers) at a time, along with the task-specific head, while keeping all other pretrained layers frozen. The performance (e.g., accuracy) is plotted against the layer/block being finetuned.
- Insights: Experiments across different architectures (ResNet, ViT), pretraining methods (Supervised, DINO), and datasets (CIFAR, Flowers102) reveal that layer importance is task-dependent and doesn't simply correlate with depth or parameter count (Figure 1). Optimal layers are often found in the middle or later stages, but not necessarily the very last ones.
Greedy SubTuning Algorithm: Since finetuning single blocks might not be optimal, and testing all combinations is computationally infeasible ( $O(\text{num\_layers}^k)$ $O (num_layers^{k})$ ), a greedy approach is proposed for selecting a subset of $k$ $k$ layers.
- Procedure: The algorithm iteratively selects the layer that provides the largest marginal improvement in validation accuracy when added to the currently selected set of layers to be finetuned. The process stops when the improvement falls below a threshold $\epsilon$ or a maximum number of layers $k$ is reached. The computational cost is $O(\text{num\_layers} \cdot k)$ .
- Pseudocode (Algorithm 1):

def GreedySubsetSelection(model, all_layers, validation_data, epsilon, max_layers):
    S = set() # Set of layers to finetune
    best_accuracy = evaluate(model, S, validation_data) # Initial accuracy (e.g., linear probing)
    
    for i in range(max_layers):
        iteration_best_accuracy = -1.0
        best_layer_to_add = None
        
        for layer in (all_layers - S):
            S_prime = S.union({layer})
            # Temporarily finetune layers in S_prime and evaluate
            current_accuracy = evaluate(model, S_prime, validation_data) 
            
            if current_accuracy > iteration_best_accuracy:
                iteration_best_accuracy = current_accuracy
                best_layer_to_add = layer
                
        # Check if adding the best layer gives sufficient improvement
        if best_layer_to_add is not None and iteration_best_accuracy > best_accuracy + epsilon:
            S.add(best_layer_to_add)
            best_accuracy = iteration_best_accuracy
        else:
            # No layer improves accuracy enough, stop
            break 
            
    return S # Return the selected set of layers

def evaluate(model, layers_to_finetune, validation_data):
    # Freeze all layers initially
    # Unfreeze layers in layers_to_finetune and the final head
    # Train the unfrozen layers on a portion of training data
    # Evaluate performance on validation_data
    # Return validation accuracy
    pass

* Note: The evaluate function involves a mini-training loop on the specified subset of layers using a portion of the training data (or cross-validation splits) to estimate the performance gain on held-out validation data.

Theoretical Motivation: The paper provides a theoretical justification suggesting that SubTuning can lead to better generalization, especially with limited data ( $m$ ). Standard finetuning generalization error scales roughly with the total number of parameters $r$ as $O(\frac{\sqrt{r}\Delta}{\sqrt{m}})$ . SubTuning, by training only $r' \ll r$ parameters, can potentially achieve an error bound of $O\left(\frac{\sqrt{r'}\Delta \log(k L)}{\sqrt{m}}\right)$ , where $L$ is the total number of layers and the $\log(kL)$ factor comes from the greedy selection process. This implies a lower sample complexity requirement for achieving good generalization.

Applications and Results

Low-Data Regime:
- VTAB-1k Benchmark: SubTuning was evaluated on datasets like CIFAR-100, Flowers102, Caltech101, and DMLab using only 1k training examples. It often outperformed full finetuning (FT), linear probing (LP), Head2Toe (H2T), and LoRA, particularly with ResNet-50 (Table 1). For ViT-B/16, it was highly competitive.
- Dataset Size Impact: Experiments on CIFAR-10 subsets showed that for very small datasets, finetuning later blocks is more beneficial, while for larger datasets, including earlier blocks becomes more advantageous (Figure 2).
- Active Learning: SubTuning combined with margin-based active learning outperformed full finetuning in selecting informative samples when labeling budget is limited (Appendix B.1, Figure 3).
Distribution Shift and Data Corruption:
- CIFAR-10-C: SubTuning was tested on adapting a CIFAR-10 model to various corruptions in CIFAR-10-C using 1k corrupted samples for finetuning. It significantly outperformed full finetuning and Surgical finetuning (which finetunes large consecutive blocks) on average across 14 corruption types (Table 2).
- Layer Selection: The greedy selection often chose a mix of early, middle, and late blocks, contradicting simpler heuristics. Notably, the final or penultimate block was often selected first, indicating its high importance for adaptation (Figure 4).
Efficient Multi-Task Learning (MTL):
- Motivation: Avoids the high compute/memory cost of running multiple fully finetuned models and the complexities/performance degradation of traditional MTL.
- Inference Strategy: When deploying a new task (model $f_{\tilde{\theta}}$ $f_{\tilde{θ}}$ ) alongside an existing one ( $f_\theta$ $f_{θ}$ ), SubTuning allows sharing the frozen layers.
  - Computation is shared up to the first finetuned layer ($\ell_\Start$).
  - Only the finetuned layers ($\ell_\Start$ to $\ell_\End$) require separate weights and computation (doubled compute/IO for these layers).
  - If finetuned layers are not the final ones, computation can be "merged" after $\ell_\End$. The outputs from the two branches (original and finetuned) are concatenated along the batch dimension and processed by the remaining shared frozen layers. This doubles the FLOPs for subsequent layers but reuses their weights (no increase in IO). See Figure 5.
- Trade-offs: Experiments showed significant accuracy gains over linear probing with minimal added latency compared to full finetuning (Figure 6). The optimal layers for the accuracy-latency trade-off depend on the specific hardware and workload (compute vs. IO bound).

Additional Implementation Aspects (Appendix)

Siamese SubTuning: An enhancement for MTL where the final classification head receives concatenated features from both the original frozen path and the SubTuned path, often improving performance, especially in low-data settings (Appendix B.2).
Pruning: SubTuning can be combined with channel pruning on the finetuned layers to further reduce parameter count and potentially runtime, with graceful degradation in accuracy (Appendix B.3).
Initialization: Using the pretrained weights for the selected layers (instead of random re-initialization) is crucial for fast convergence and optimal performance (Appendix B.4).

Conclusion

SubTuning presents a practical and effective method for parameter-efficient transfer learning. By identifying and finetuning only the most relevant layers for a downstream task using the finetuning profile and a greedy selection algorithm, it achieves strong performance, particularly in data-scarce or distribution shift scenarios. Its key advantage lies in enabling efficient deployment of multiple specialized tasks derived from a single pretrained backbone with minimal computational overhead, offering a flexible alternative to full finetuning and linear probing.