Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
Abstract
Group3D is a multi-view open-vocabulary 3D detection framework that integrates semantic constraints into instance construction through semantic compatibility groups, improving accuracy in pose-known and pose-free settings.
Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.
Community
We are excited to share our recent work "Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection".
TL;DR: We present Group3D, enabling robust open-vocabulary 3D object detection by integrating semantic constraints into multi-view 3D instance construction.
We leverage multimodal large language models (MLLMs) to build scene-adaptive vocabularies and semantic compatibility groups, which guide 3D fragment merging alongside geometric consistency. This semantically-aware merging mitigates over-merging and fragmentation, improving performance in multi-view RGB settings, including pose-free and zero-shot scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas (2026)
- ReLaGS: Relational Language Gaussian Splatting (2026)
- UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing (2026)
- SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images (2026)
- Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence (2026)
- Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes (2026)
- Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper