2 min read

PerceptionLM: A Fully Open and Reproducible Framework for Detailed Visual Understanding

PerceptionLM: A Fully Open and Reproducible Framework for Detailed Visual Understanding

Vision-language models (VLMs) are now a cornerstone of computer vision research, widely used in both academia and industry. However, many high-performing models remain closed-source, obscuring their data, design, and training recipes. This lack of transparency has led the research community to rely on distillation from black-box models to label training data, achieving strong benchmark results at the cost of measurable scientific progress. Without knowing the details of the teacher model and its data sources, it's difficult to track genuine advancements in the field.

Enter PerceptionLM (PLM), a fully open and reproducible model for transparent research in image and video understanding. Developed by a team from Meta FAIR, UT Austin, MBZUAI, and Meta Reality Labs, PLM aims to demystify the training of VLMs from scratch. The project releases 2.8 million human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions, nearly an order of magnitude larger than existing datasets.

Key Contributions

  1. Open-Access Data and Models: PLM is built without distillation from proprietary models, using a white-box data engine to generate synthetic data. The team identifies critical gaps in video understanding, particularly in spatio-temporal reasoning and fine-grained understanding tasks.
  2. Large-Scale Human-Annotated Data: The release includes:
  • PLM–FGQA: 2.4 million fine-grained video QA pairs focusing on 'how' actions are performed.
  • PLM–STC: 476,000 spatio-temporally grounded video captions with segmentation masks.
  1. PLM–VideoBench: A new benchmark suite evaluating challenging video understanding tasks, focusing on 'what', 'where', 'when', and 'how' aspects of videos.

Model Architecture

PLM consists of a vision encoder (Perception Encoder) connected to a small-scale LLM decoder (Llama 3, with 1B, 3B, or 8B parameters). It supports high-resolution images (up to 36 tiles) and videos (32 frames), with dynamic tiling and spatial pooling to manage token counts.

Training Pipeline

PLM is trained in three stages:

  1. Projector Warm-Up: Freezes the vision encoder and LLM, training only the vision projector on synthetic image data.
  2. Large-Scale Midtraining: Trains on diverse synthetic image and video data (64.7M samples).
  3. Supervised Finetuning: Uses high-quality human-annotated data to tackle challenging video tasks.

Performance

PLM achieves comparable performance to state-of-the-art open-weight models (e.g., InternVL2.5) without proprietary distillation. The 8B model outperforms Qwen2.5VL in 10 image and 15 video benchmarks, with significant improvements in perception-focused tasks (+9.1 points), video captioning (+39.8 CIDEr), and fine-grained video QA (+3.8 points).

Why This Matters

PLM sets a new standard for reproducible VLM research by providing:

  • Transparency: Full access to data, training recipes, code, and models.
  • Scalability: Insights into synthetic data scaling laws and critical data gaps.
  • Novel Capabilities: Support for fine-grained QA and region-based dense video captioning, enabling applications like AI coaching and grounded video transcription.

Limitations and Future Work

While PLM excels in many areas, it shows room for improvement in long-video modeling and tasks requiring extensive world knowledge. Future work may explore integrating long-video components and expanding the data mix to include multi-step reasoning and robotics data.

Get Involved

PLM represents a significant step toward transparent, reproducible research in visual perception, offering a foundation for future advancements in detailed visual understanding.