12 Jun 2025 2 min read

Meta’s DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

Researchers at Meta have introduced DGS-LRM (Deformable Gaussian Splats Large Reconstruction Model), a breakthrough in real-time 3D reconstruction from monocular videos. Unlike traditional methods that require hours of optimization, DGS-LRM reconstructs dynamic scenes—complete with geometry, appearance, and motion—in a single forward pass, achieving inference speeds of 0.6 seconds on an A100 GPU.

Why This Matters

Dynamic scene reconstruction has applications in AR/VR, robotics, and digital content creation. Until now, most feed-forward models could only handle static scenes, while dynamic scene reconstruction relied on slow, optimization-heavy pipelines. DGS-LRM changes that by predicting deformable 3D Gaussians—a representation that supports high-quality novel view synthesis and long-range 3D tracking—directly from a monocular video.

Key Innovations

Deformable 3D Gaussian Representation
Each pixel in a video frame is mapped to a deformable 3D Gaussian splat, which includes depth, color, rotation, scale, opacity, and 3D scene flow (motion vectors across timestamps). This allows for realistic warping of objects over time.
Large-Scale Synthetic Training Data
To overcome the scarcity of real-world training data with ground-truth 3D motion, Meta built a custom synthetic dataset using Kubric, a physics-based simulator. The dataset includes multi-view videos with per-pixel 3D scene flow annotations, enabling the model to generalize to real-world footage.
Transformer Architecture
DGS-LRM uses a 24-layer transformer with temporal tokenization to efficiently process video inputs. Unlike previous methods that tokenize frames independently, this approach compresses spatiotemporal cubes, reducing computational overhead.

Performance Highlights

Real-Time Inference: Processes 24 FPS video in 0.495 seconds (vs. hours for optimization-based methods).
State-of-the-Art Quality: Matches optimization-based deformable Gaussian splatting (D3DGS) in reconstruction fidelity while being orders of magnitude faster.
Accurate 3D Tracking: Predicted scene flow achieves comparable performance to specialized monocular 3D tracking methods like SpatialTracker.

Applications

AR/VR: Instant 3D scene capture for immersive experiences.
Content Creation: Rapid digital twin generation for films and games.
Robotics: Real-time environment mapping for navigation.

Limitations

Struggles with large motions (due to synthetic training data constraints).
Requires temporally continuous video (discrete frames degrade performance).
Novel view quality drops with extreme camera deviations from the input trajectory.

The Big Picture

DGS-LRM represents a leap toward real-time, generalizable 3D reconstruction of dynamic scenes. By combining deformable 3D Gaussians with large-scale synthetic training, Meta has set a new benchmark for feed-forward methods. Future work could address motion scale limitations and improve robustness for in-the-wild videos.

For more details, check out the full paper on arXiv.