19 May 2025 3 min read

How AI Can Automatically Shape Rewards from Confused Data

Reinforcement learning (RL) is a powerful tool for training AI agents to perform complex tasks, from playing games to controlling robots. But one of the biggest challenges in RL is designing the right reward function—the signal that tells the agent whether it’s doing well or not. Traditionally, this requires a lot of manual tweaking by experts. But what if we could automate this process, even when the data we’re learning from is messy or incomplete?

A new paper titled "Automatic Reward Shaping from Confounded Offline Data" by Mingxuan Li, Junzhe Zhang, and Elias Bareinboim introduces a novel method to do just that. The researchers propose a way to automatically design reward functions from offline datasets—collections of past interactions—even when those datasets are contaminated with unobserved confounding bias. In other words, their method works even when the data doesn’t tell the whole story.

The Problem with Reward Shaping

Reward shaping is a technique used to speed up the learning process of RL agents by adding extra signals to the reward function. The idea is to guide the agent toward high-rewarding states more efficiently. However, designing these shaping functions often relies on domain expertise and manual effort, which can be time-consuming and prone to human bias.

To make matters worse, when the data used to design these shaping functions is confounded—meaning it’s influenced by hidden variables—traditional methods can lead to incorrect or misleading rewards. For example, if a robot learns from a dataset where some actions were influenced by unobserved factors (like wind direction in a navigation task), it might learn suboptimal or even dangerous behaviors.

A Causal Approach to Reward Shaping

The key innovation in this paper is the use of causal reasoning to handle confounded data. The researchers frame the problem within Confounded Markov Decision Processes (CMDPs), a model that accounts for unobserved variables affecting actions, rewards, and state transitions. They then derive a Causal Bellman Optimal Equation, which provides a robust upper bound on the optimal state values even when the data is confounded.

This upper bound is used to design Potential-Based Reward Shaping (PBRS) functions automatically. PBRS is a class of reward shaping that guarantees the optimal policy remains unchanged, ensuring the agent doesn’t learn the wrong behavior from the shaped rewards.

Better Learning with Shaped Rewards

The paper also introduces a modified version of the Q-UCB algorithm, a model-free RL method, that leverages these automatically shaped rewards. The authors prove that their method enjoys a better gap-dependent regret bound—a measure of how quickly the agent converges to the optimal policy—compared to learning without shaping.

In experiments, the researchers tested their approach in custom grid-world environments where wind direction (an unobserved confounder) affects movement. Their method outperformed baselines that used naive reward shaping or no shaping at all, demonstrating faster convergence and better final performance.

Why This Matters

This work has significant implications for real-world applications of RL, where offline datasets are often noisy or incomplete. For example:

Robotics: Robots learning from human demonstrations may encounter confounded data (e.g., unobserved environmental factors). This method ensures they still learn safe and efficient policies.
Healthcare: Treatment policies learned from historical patient data could be biased by unrecorded variables. Causal reward shaping could help avoid harmful recommendations.
Autonomous Systems: Self-driving cars or drones could use this method to learn from past driving logs, even if some critical sensors were missing in the data.

The Bottom Line

By combining causal inference with reinforcement learning, this paper offers a principled way to automate reward shaping from imperfect data. It’s a step toward making RL more robust and scalable, reducing the need for manual reward engineering. For businesses deploying AI agents, this could mean faster training, safer policies, and better performance—even when the data isn’t perfect.