The Latest in Visual Prompt Learning: A Review from Fudan University Researchers
In an exciting development within the realm of artificial intelligence, a team of researchers from Fudan University has published a comprehensive review article focusing on visual prompt learning in the special issue titled "Latest Advances in Artificial Intelligence Generated Content" of Frontiers of Information Technology & Electronic Engineering, Vol. 25, No. 1, 2024. This review highlights the deep interplay between vision and language information in Vision-Language Models (VLMs) and underscores how effective prompt learning is driving innovations across various application areas.
Understanding Vision and Language Models (VLMs)
At the heart of contemporary VLMs is the vital task of cross-modal alignment, which is critical for understanding the relationships between images and text. These models leverage large-scale image-text pairs to refine their performance. A prominent example is the Contrastive Language-Image Pre-training (CLIP), which employs contrastive learning to empower image and text encoders. This framework enables capabilities like zero-shot and few-shot learning and is instrumental in image-text retrieval tasks. Similarly, ALIGN, or Large-scale ImaGe and Noisy-text embedding, enhances robustness through extensive datasets, while Bootstrapping Language-Image Pre-training (BLIP) and Vision-and-Language Transformer (ViLT) optimize efficiency via innovative methods like bootstrapped caption generation and streamlined architectures. Collectively, these models serve as powerful feature encoders, laying the groundwork for effective prompt learning.
The Two Facets of Visual Prompt Learning
Visual prompt learning can be categorized into two primary types: language prompting and visual prompting.
-
Language Prompting: This approach aims to increase a model’s adaptability to novel categories by utilizing learnable text contexts. Models like CoOp and CoCoOp exemplify this category, while the Prompt Learning with Optimal Transport (PLOT) introduces a sophisticated mechanism for multi-prompt alignment, particularly for intricate attributes.
- Visual Prompting: This variant focuses on adapting pre-trained models through techniques such as image perturbations or masks. Techniques like VP and MAE-VQGAN illustrate this methodology, while innovations like Class-Aware Visual Prompt Tuning (CAVPT) and Iterative Label Mapping based Visual Prompting (ILM-VP) enhance performance by merging textual with visual features. Multi-modal prompting models, including MaPLe and Instruction-ViT, uniquely integrate visual and language cues, demonstrating superior efficiency in cross-modal tasks.
The Role of Prompt-Guided Generative Models
In recent advances, prompt-guided generative models, particularly diffusion models, have integrated seamlessly with VLMs to facilitate semantically controlled image generation, editing, and inpainting. Notable models such as Stable-Diffusion and Imagen have reduced computational demands through the use of latent space diffusion. These frameworks support multi-modal prompts, including text and image masks. Furthermore, models like ControlNet and DreamBooth have enhanced flexibility through conditional controls and few-shot fine-tuning. In the realm of image editing, tools like SmartBrush and Blended-Diff provide precision in inpainting through user interactions with text and masks. This underscores the pivotal role that prompt learning plays in the generation process.
Efficient Adaptation through Prompt Tuning
Prompt tuning endeavors to fine-tune large-scale models effectively for specific downstream tasks. Visual Prompt Tuning (VPT) achieves parameter-efficient adaptation in Vision Transformers (ViTs) by incorporating learnable prompts either at the input or intermediate layers. Long-Tailed Prompt Tuning (LPT) enhances classification strategies for long-tailed datasets. Similarly, in VLM contexts, methodologies like TCM and V-VL facilitate better cross-modal interactions through dual-modal prompt generators, while approaches like LoRA and Adapter series help reduce training parameters, making them invaluable for resource-limited environments. These advancements point to a growing trend where computational efficiency does not come at the cost of model performance.
Future Directions in Prompt Learning Research
The review identifies various intriguing future research directions for prompt learning, such as:
- Addressing domain shifts and enhancing interpretability in image classification tasks.
- Optimizing models like Segment Anything Model (SAM) by integrating domain knowledge for improved semantic segmentation.
- Reducing discrepancies between base classes and novel classes in open-vocabulary object detection.
- Establishing task-prompt relationships in multi-task learning scenarios.
- Incorporating Chain-of-Thought (CoT) methodologies to bolster multi-step reasoning.
- Expanding the application of prompt learning to specialized fields such as medical imaging and weather forecasting.
- Innovating cross-view robust visual prompts for gait recognition tasks.
These identified avenues highlight not just the vast potential for growth in prompt learning but also the promise it holds for developing more interpretable and generalized AI systems in the future.
For those interested in exploring this subject further, the full text of the open-access paper, titled “Prompt learning in computer vision: a survey,” authored by Yiming Lei, Jingqi Li, Zilong Li, Yuan Cao, and Hongming Shan, can be found here.