CLAP: Enhancing Linear Probing for Efficient Few-Shot Learning in Vision-Language Models

A CVPR Paper Review and Cliff’s Notes

4 min readJun 11, 2024

Few-shot learning has become increasingly important for adapting large pre-trained vision-language models (VLMs) like CLIP to downstream tasks with limited labelled data.

However, current state-of-the-art methods for this efficient transfer learning (ETL) scenario often make unrealistic assumptions and require impractical per-task hyperparameter tuning. In their recent paper, “A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models,” the authors discuss these issues and propose a novel approach called CLAP (CLass-Adaptive linear Probe). CLAP outperforms existing methods across various benchmarks and operates under more realistic and practical constraints.

In this post, I’ll review the key insights from the paper, understand the limitations of current methods, and see how CLAP addresses these challenges to push the state-of-the-art in the few-shot adaptation of VLMs.

The Problem

Existing Adapters show strong performance only in specific experimental setups and with extensive hyperparameter tuning based on a large labeled corpus. Outperforming a well-designed Linear Probing (ZS-LP) baseline requires unrealistic hyperparameter optimization for each target task. Source

Adapting large pre-trained VLMs to downstream tasks with only a few labelled examples is challenging. Current state-of-the-art ETL methods for this few-shot adaptation scenario have limitations:

Existing methods make unrealistic assumptions about access to a large corpus of labeled data for hyperparameter tuning.
They require carefully tuning hyperparameters for each specific task using a large validation set, which is unrealistic.
The hyperparameters optimized for one task don’t generalize well to other tasks.
They can dramatically underperform simple zero-shot predictions in the presence of distribution shifts.

The Solution

The authors have proposed a novel approach to fit real-world scenarios. They introduce CLAP (Class-Adaptive Linear Probe) objective, which constrains learned prototypes to retain prior zero-shot knowledge based on only a few support shots adaptively and uses a homogeneous learning configuration across tasks. Source

The authors propose a new approach called CLAP:

It builds on a designed Linear Probing (LP) baseline initialized with CLIP’s zero-shot (ZS) class prototypes. This ZS-LP baseline already outperforms more complex ETL methods.
To further improve ZS-LP, CLAP introduces a constrained optimization objective that penalizes large deviations of the learned class prototypes from the original zero-shot prototypes during adaptation.
It uses an Augmented Lagrangian Multiplier method to optimize the constrained objective. The ALM method is adapted to use class-wise penalty multipliers (vs sample-wise) to handle class imbalances and work with data augmentation.

CLAP has several advantages over existing ETL methods:

It performs consistently across various tasks and datasets with the same configuration. No per-task hyperparameter tuning is needed.
It substantially outperforms state-of-the-art ETL approaches in all evaluated scenarios.
The efficient linear probing architecture enables faster adaptation with less computation.
The realistic adaptation setting without relying on a large validation set makes it more practical.

Key Contributions

Empirically showing that SoTA ETL methods require unrealistic per-task hyperparameter tuning and can underperform simple baselines without it.
Proposing CLAP, a principled approach to improve Linear Probing for few-shot adaptation of VLMs. It constrains deviations from zero-shot prototypes and eliminates hyperparameter tuning.

Results

Results show CLAP delivers consistent strong performance across tasks with a fixed configuration, substantially outperforming state-of-the-art ETL methods in all cases.

CLAP is evaluated on:

Few-shot adaptation on 11 classification datasets
Domain generalization scenarios
Comparison to full fine-tuning methods
Ablation studies validating design choices

Final Thoughts

This work highlights key issues with current few-shot adaptation methods for VLMs.

It proposes a novel principled approach to address them. CLAP’s strong performance with a fixed configuration across various tasks and datasets demonstrates the importance of designing methods that can realistically adapt without relying on impractical hyperparameter tuning. The authors hope this moves the field towards more practical and robust ETL solutions.

Learn more here: