3 min read

Refer to Anything: How Vision-Language Prompts Are Revolutionizing Image Segmentation

Refer to Anything: How Vision-Language Prompts Are Revolutionizing Image Segmentation

In a groundbreaking new paper titled Refer to Anything with Vision-Language Prompts, researchers from the University of Illinois Urbana-Champaign and Adobe introduce a novel approach to image segmentation that could transform how businesses interact with visual data. The work, published on arXiv, presents a framework called Refer to Any Segmentation Mask Group (RAS), which enables highly flexible, multimodal interactions for segmenting objects in images based on arbitrary combinations of text and visual prompts.

The Challenge: Beyond Traditional Segmentation

Current image segmentation models excel at producing high-quality masks for visual entities, but they struggle with complex queries that require understanding both language and visual context. This limitation makes them less effective for real-world applications where users need to interact with images in intuitive, multimodal ways—whether in autonomous driving, robotics, augmented reality, or image editing.

For example, consider tasks like:

  • "Select all spools that are not green."
  • "Find all players wearing red jerseys."
  • "Water bottles similar to."

Traditional models can't handle such nuanced instructions, especially when they involve relationships or comparisons with reference visual entities. Describing these references purely in language is often cumbersome or imprecise, particularly in complex scenes.

Introducing ORES: Omnimodal Referring Expression Segmentation

The paper proposes a new task called Omnimodal Referring Expression Segmentation (ORES), which extends classic referring expression segmentation (RES) by allowing prompts to include both text and reference visual entities. This enables more expressive and practical interactions, as users can now:

  1. Describe targets via text (e.g., category, attribute, position).
  2. Provide visual prompts (e.g., masks of reference objects) to specify relationships that are hard to verbalize.

The key innovation is that ORES outputs groups of masks that satisfy the prompt, rather than just single objects. This is crucial for applications like object removal, editing, or scene understanding, where multiple entities may need to be manipulated simultaneously.

The RAS Framework: Bridging Segmentation and Language Models

To tackle ORES, the team developed RAS, a framework that combines the strengths of segmentation foundation models (like Meta's Segment Anything Model) and large multimodal models (LMMs). Here's how it works:

  1. Candidate Mask Proposal: A segmentation model (e.g., SAM) proposes a pool of potential masks for an image.
  2. Mask Tokenization: Each candidate mask is converted into a "mask token"—a compact representation that captures the visual entity's features.
  3. Multimodal Comprehension: A mask-centric LMM (based on LLaVA-1.5) processes the tokens alongside text prompts to understand which masks belong to the target group.
  4. Non-Autoregressive Decoding: Unlike traditional LMMs that generate outputs sequentially, RAS uses a more efficient binary classification approach to select relevant masks in one pass.

This design avoids the pitfalls of autoregressive decoding (which struggles with unordered sets) and leverages the semantic understanding of LLMs to interpret complex prompts.

Datasets: MASKGROUPS-2M and MASKGROUPS-HQ

Training RAS required new datasets:

  • MASKGROUPS-2M: A large-scale dataset automatically generated by repurposing annotations from existing datasets (MS-COCO, LVIS, Visual Genome, etc.). It includes 2M mask groups based on categories, attributes, positions, and free-form descriptions.
  • MASKGROUPS-HQ: A smaller, high-quality dataset with 100K human-annotated mask groups. These annotations cover diverse, creative grouping criteria (e.g., "All animals with eyes showing" or "Objects on"), ensuring alignment with real-world use cases.

Results: State-of-the-Art Performance

RAS outperforms existing models across multiple benchmarks:

  • ORES: RAS achieves a 74.59 cIoU on MASKGROUPS-HQ, significantly surpassing prior GRES models (which can't even process visual prompts).
  • RES/GRES: When adapted to classic referring segmentation tasks, RAS sets new records (77.8 cIoU on RefCOCO and 71.79 cIoU on gRefCOCO).

The paper also highlights RAS's efficiency: its non-autoregressive decoding is 2.13× faster than autoregressive alternatives while being more accurate.

Business Implications

This work has immediate applications in:

  • Content Creation: Precise object selection/editing in tools like Photoshop.
  • E-Commerce: Automatically segmenting products based on multimodal queries.
  • Autonomous Systems: Enhanced scene understanding for robotics and self-driving cars.

By enabling seamless interaction with visual data through natural language and visual prompts, RAS bridges the gap between human intuition and machine perception—a critical step toward more intuitive AI-powered workflows.

For more details, check out the project page or the full paper on arXiv.