Saturday, August 2, 2025

Defining Presence, Revealing Vision

Share

Highlights from IIITH’s Participation at CVPR 2025

A large contingent from IIITH’s Computer Vision Lab recently showcased their cutting-edge research at the Conference on Vision and Pattern Recognition (CVPR) in Nashville. This prestigious event is a significant milestone for researchers in the field of computer vision, often regarded as more impactful than traditional journal publishing. Read on to explore the innovative findings presented and their implications in the realm of visual computing.

The Significance of CVPR

For computer vision professionals, CVPR is the holy grail. Achieving paper acceptance at this event is not merely an accomplishment; it’s a validation of research quality, relevance, and potential impact. Ranking among the top three conferences globally, CVPR boasts an acceptance rate of under 30% for submitted papers and a meager 5% for oral presentations, making it a sought-after platform for researchers. The Centre for Visual Information Technology (CVIT) at IIITH has been participating in CVPR since 2008, consistently marking its presence with impressive contributions. This year, they presented over seven papers, showcasing their ongoing commitment to advancing the field.

A New Benchmark for Video Language Models

One of the standout papers, "VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment," authored by Darshana Saravanan and her colleagues, addresses a critical question: Can AI analyze short videos the way humans do? The research established a benchmark to evaluate whether leading video-language models, such as Gemini-1.5-Pro and GPT-4o, can comprehend videos in a compositional and contextual manner. The findings revealed that even high-performing models scored poorly—less than 50% accuracy—compared to a whopping 93% by humans. Darshana emphasized a key insight: models struggled with incorrect answers that incorporated real elements from videos, rather than random distractors. Their rigorous evaluation method, called StrictVLE, showcases the potential for improving AI understanding of dynamic content.

Honorable Recognition: Best Paper Award

During the workshops associated with CVPR, Darshana’s other work, "Pseudo-labelling meets Label Smoothing for Noisy Partial Label Learning," was awarded the Best Paper Award at the 12th Workshop on Fine-Grained Visual Categorization. This research addresses the challenge of training image classification models in difficult fields like wildlife monitoring, where obtaining accurate labels is often time-consuming and costly. The algorithm created, PALS, adapts to label inaccuracies based on image similarity, significantly enhancing model training efficiency. Testing across several datasets, the method showed notable improvements, particularly in fine-grained tasks.

Oral Presentations and Innovative Findings

At the First Workshop on Mechanistic Interpretability for Vision, Darshana presented yet another paper on "Investigating Mechanisms for In-Context Vision Language Binding." This work was selected for an oral presentation, emphasizing the depth of research emerging from IIITH.

Moreover, the workshop on AI for Creative Visual Content Generation featured Aishwarya Agarwal, along with her collaborators, who presented "Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis." This paper advocates for a method that permits AI to generate artistic content while staying true to the constraints provided by users.

Insights from Road Safety Research

In a unique exploration of social media’s potential for AI training, Prof. Ravikiran Sarvadevabhatla and his team harnessed road event videos shared online, enriching them with descriptive social commentary. Their paper, "RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding From Social Video Narratives," aims to develop a dataset capable of testing video-language models’ understanding. By curating question-and-answer pairs from diverse video sources, they hope to break free from conventional dashcam biases and improve road safety education for diverse audiences.

A Game with Purpose

An inventive foray into gaming resulted in the paper "Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback." This project, led by Prof. Sarvadevabhatla and his student Mohd. Hozaifa Khan, explores a computer program that plays Pictionary as a human would. By curating a vast dataset and developing AI agents capable of drawing and guessing, the researchers aim to benchmark multi-modal communication in a fun yet rigorous manner.

Advancements in Object Recognition

Aishwarya Agarwal’s quest for improved object recognition led to the creation of a paper titled "TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction." This innovative work breaks ground by training models to focus on crucial, localized parts of objects rather than relying solely on complete images. This approach resulted in significant performance boosts on various datasets, showcasing an exciting path forward for enhancing AI understanding.

Empowering Younger Researchers

Several undergraduate researchers from IIITH also made their mark at CVPR 2025, a noteworthy endeavor given the conference’s elite status. Vaibhav Agrawal, with his paper "Compass Control: Multi-Object Orientation Control for Text-to-Image Generation," showcased his ability to influence image generation models for more precise object placement. Similarly, Haran Raajesh contributed to the field with "Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues." His work highlights the promise of contextual understanding enabling more accurate translations of sign language videos into text.

The Personal Impact of Mentorship and Collaboration

Prof. Vineet Gandhi shared insights about the inspirational journey of his students, particularly focusing on Darshana’s resilience through multiple rejections before finally being recognized for her work. This narrative underscores the collaborative and often challenging nature of research, where the amalgamation of ideas and perseverance can lead to extraordinary results.


These highlights not only mark the impressive achievements of IIITH’s Computer Vision Lab but also emphasize the vibrant research culture that encourages collaboration and innovation. As the field of computer vision continues to evolve, IIITH stands at the forefront, pushing boundaries and redefining possibilities.

Read more

Related updates