Tuesday, June 24, 2025

Apple Showcases Machine Learning Innovations at CVPR 2025

Share

Apple’s Commitment to Advancing AI and ML at CVPR 2025

Apple continues to lead the way in artificial intelligence (AI) and machine learning (ML) through rigorous fundamental research. As part of its commitment to advancing these fields, Apple actively engages with the broader research community, sharing insights and breakthroughs through publications and conference participation. One of the most significant events for AI and ML advancement is the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), and Apple is delighted to participate as an industry sponsor in Nashville, Tennessee.

Advancements in Computer Vision at CVPR

This year at CVPR 2025, Apple researchers will showcase cutting-edge research across multiple topics in computer vision, such as vision-language models, 3D photogrammetry, large multimodal models, and video diffusion models. Attendees will have the chance to explore these innovations firsthand at Apple’s booth (#1217) during exhibition hours, where live demonstrations will highlight the potential applications of these technologies.

FastVLM: Efficient Vision Encoding for Vision Language Models

One of the highlighted advancements is FastVLM, which addresses a common challenge in vision-language models: the inefficiency of popular visual encoders, particularly ViTs, when processing high-resolution images. Traditional encoders require a considerable amount of tokens and often result in high encoding latency, which presents challenges for real-time applications.

Apple’s researchers will present FastViTHD, a novel hybrid vision encoder designed to optimize encoding time while retaining accuracy. FastVLM significantly improves the accuracy-latency trade-off through its efficient design, making it well-suited for on-device applications that prioritize privacy. Interested developers can access the model checkpoints and even a demo app based on Apple’s MLX framework here.

Matrix3D: A Unified Approach to Photogrammetry

Photogrammetry, the process of constructing 3D scenes from 2D images, faces challenges, primarily the need for dense collections of images and the disjointed nature of various processing tasks. At CVPR, Apple will present Matrix3D, a unified model that performs multiple photogrammetry tasks concurrently, such as pose estimation and depth prediction.

By employing a multi-modal diffusion transformer, Matrix3D integrates various modalities—like images and depth maps—into a cohesive model. This approach not only improves the reliability of 3D reconstructions but also allows for full-modality training even with incomplete datasets. For enthusiasts, the code is available here.

Multimodal Autoregressive Pre-Training of Large Vision Encoders

Apple’s research on multimodal models has led to significant advancements in training methodologies. Traditionally, vision encoders were trained with discriminative objectives, which created mismatches in generative tasks. The presentation of AIMv2, a family of large vision encoders pre-trained with an autoregressive framework, addresses this gap.

These models excel at multimodal tasks and perform impressively in visual recognition benchmarks. Notably, AIMv2 achieves this efficiency with significantly fewer samples during pre-training compared to existing models. Learn more and access the model checkpoints here.

World-Consistent Video Diffusion with Explicit 3D Modeling

The potential for realistic video generation through diffusion models is enormous, yet many struggle with 3D consistency. Apple’s World-Consistent Video Diffusion (WVD) offers a groundbreaking approach by training a diffusion transformer to understand the joint distribution of color (RGB) and spatial coordinates (XYZ).

This model enables seamless transitions from RGB frame generation to 3D estimation and even allows for innovative capabilities like single-image-to-3D generation. The flexibility of WVD opens new avenues for various applications, thereby enhancing both the efficiency and quality of generated content. More details can be found here.

Engaging with the Community

Apple is dedicated to fostering a diverse and inclusive ML research community. To this end, the company proudly sponsors several affinity group events at CVPR, designed to support underrepresented groups. Notable events include workshops hosted by LatinX in CV and Women in Computer Vision, scheduled for June 11 and June 12, respectively.

Experience Apple’s ML Research Live

Attendees at CVPR 2025 will have the opportunity to experience live demonstrations of Apple’s machine learning research in booth #1217. The display will not only feature FastVLM but also other groundbreaking models mentioned above, allowing researchers and developers to engage directly with Apple’s innovative technologies.

Expanding Through Collaboration

CVPR serves as a critical platform for researchers to advance the state of the art in computer vision. Through its contributions and engagement at the conference, Apple not only showcases its research but also strengthens its connections with the wider AI and ML community. For a comprehensive overview of Apple’s involvement in CVPR 2025 and detailed schedules, visit this link.

By participating in this event, Apple reaffirms its commitment to driving innovation in AI and ML, inspiring collaboration, and propelling the community forward.

Read more

Related updates