Jointly learning rewards and policies: an iterative Inverse Reinforcement Learning framework with ranked synthetic trajectories | by Hussein Fellahi | Nov, 2024

0


2.1 Apprenticeship Learning:

A seminal method to learn from expert demonstrations is Apprenticeship learning, first introduced in [1]. Unlike pure Inverse Reinforcement Learning, the objective here is to both to find the optimal reward vector as well as inferring the expert policy from the given demonstrations. We start with the following observation:

Mathematically this can be seen using the Cauchy-Schwarz inequality. This result is actually quite powerful, as it allows to focus on matching the feature expectations, which will guarantee the matching of the value functions — regardless of the reward weight vector.

In practice, Apprenticeship Learning uses an iterative algorithm based on the maximum margin principle to approximate μ(π*) — where π* is the (unknown) expert policy. To do so, we proceed as follows:

  • Start with a (potentially random) initial policy and compute its feature expectation, as well as the estimated feature expectation of the expert policy from the demonstrations (estimated via Monte Carlo)
  • For the given feature expectations, find the weight vector that maximizes the margin between μ(π*) and the other (μ(π)). In other words, we want the weight vector that would discriminate as much as possible between the expert policy and the trained ones
  • Once this weight vector w’ found, use classical Reinforcement Learning — with the reward function approximated with the feature map ϕ and w’ — to find the next trained policy
  • Repeat the previous 2 steps until the smallest margin between μ(π*) and the one for any given policy μ(π) is below a certain threshold — meaning that among all the trained policies, we have found one that matches the expert feature expectation up to a certain ϵ

Written more formally:

Source: Principles of Robot Autonomy II, lecture 10 ([2])

2.2 IRL with ranked demonstrations:

The maximum margin principle in Apprenticeship Learning does not make any assumption on the relationship between the different trajectories: the algorithm stops as soon as any set of trajectories achieves a narrow enough margin. Yet, suboptimality of the demonstrations is a well-known caveat in Inverse Reinforcement Learning, and in particular the variance in demonstration quality. An additional information we can exploit is the ranking of the demonstrations — and consequently ranking of feature expectations.

More precisely, consider ranks {1, …, k} (from worst to best) and feature expectations μ₁, …, μₖ. Feature expectation μᵢ is computed from trajectories of rank i. We want our reward function to efficiently discriminate between demonstrations of different quality, i.e.:

In this context, [5] presents a tractable formulation of this problem into a Quadratic Program (QP), using once again the maximum margin principle, i.e. maximizing the smallest margin between two different classes. Formally:

This is actually very similar to the optimization run by SVM models for multiclass classification. The all-in optimization model is the following — see [5] for details:

Source: [5]

2.3 Disturbance-based Reward Extrapolation (D-REX):

Presented in [4], the D-REX algorithm also uses this concept of IRL with ranked preferences but on generated demonstrations. The intuition is as follows:

  • Starting from the expert demonstrations, imitate them via Behavioral cloning, thus getting a baseline π₀
  • Generate ranked sets of demonstration with different degrees of performance by injecting different noise levels to π₀: in [4] authors prove that for two levels of noise ϵ and γ, such that ϵ > γ (i.e. ϵ is “noisier” than γ) we have with high probability that V[π(. | ϵ)] < V[π’. | γ)]- where π(. | x) is the policy resulting from injecting noise x in π₀.
  • Given this automated ranking provided, run an IRL from ranked demonstrations method (T-REX) based on approximating the reward function with a neural network trained with a pairwise loss — see [3] for more details
  • With the approximation of the reward function R’ gotten from the IRL step, run a classical RL method with R’ to get the final policy

More formally:

Leave a Reply

Your email address will not be published. Required fields are marked *