09 Jun 2025 2 min read

Reflect-then-Plan: A Doubly Bayesian Approach to Offline Model-Based Planning

Offline reinforcement learning (RL) is a powerful tool for training policies when online exploration is costly or unsafe. However, it often struggles with high epistemic uncertainty due to limited data coverage. Most existing methods rely on fixed conservative policies, which can be overly restrictive and lack adaptability. Enter Reflect-then-Plan (RefPlan), a novel approach that unifies uncertainty modeling and model-based (MB) planning through a doubly Bayesian lens.

The Challenge of Offline RL

Offline RL learns from static datasets, but the agent’s inability to gather new experiences means it can’t precisely identify the true Markov decision process (MDP). This leads to high epistemic uncertainty for states and actions outside the data distribution. Traditional methods address this by learning conservative policies that stay close to the data distribution, but these can be inflexible and fail in unexpected states.

How RefPlan Works

RefPlan tackles this by recasting planning as Bayesian posterior estimation. At deployment, it:

Reflects: Updates a belief over environment dynamics using real-time observations.
Plans: Incorporates this uncertainty into MB planning via marginalization, considering a range of possible scenarios beyond the agent’s immediate knowledge.

This approach allows RefPlan to maintain robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics.

Key Innovations

Doubly Bayesian Framework: Combines epistemic uncertainty modeling with MB planning in a unified probabilistic framework.
Real-Time Adaptation: Uses a variational autoencoder to infer a posterior belief distribution from past experiences during test time.
Marginalization Over Uncertainty: Plans by marginalizing over the agent’s epistemic uncertainty, resulting in a posterior distribution over optimized plans.

Empirical Results

RefPlan was tested on standard offline RL benchmarks, showing significant improvements over conservative offline RL policies. Key findings include:

Robustness Under Uncertainty: Maintains performance even when initialized in out-of-distribution (OOD) states.
Flexibility: Enhances policies learned from diverse offline RL algorithms, including CQL, EDAC, MOPO, COMBO, and MAPLE.
Resilience to Data Limitations: Performs well with limited data, outperforming baselines as dataset size decreases.
Adaptability to Dynamic Changes: Shows superior resilience to shifts in environment dynamics compared to fixed policies.

Why This Matters

RefPlan addresses a critical limitation in offline RL: the inability to adapt to unseen scenarios. By explicitly modeling and marginalizing over epistemic uncertainty, it provides a more flexible, generalizable, and robust solution for real-world applications where exploration is costly or unsafe.

Future Directions

While RefPlan excels in standard benchmarks, future work could explore its application to more complex models and environments. Additionally, integrating data augmentation techniques could further enhance its adaptability to dynamic changes.

RefPlan represents a significant step forward in offline RL, offering a principled way to handle uncertainty and improve policy performance at test time. For more details, check out the full paper on arXiv.