(Translated by https://www.hiragana.jp/)
Barrier Functions Inspired Reward Shaping for Reinforcement Learning

Barrier Functions Inspired Reward Shaping for
Reinforcement Learning

Nilaksh1∗, Abhishek Ranjan2∗, Shreenabh Agrawal2∗, Aayush Jain1, Pushpak Jagtap3, Shishir Kolathaya4 This work is supported in part by the Google Research Grant, the ARTPARK, and the SERB Grants SRG/2022/001807 and CRG/2021/008115.1Indian Institute of Technology (IIT), Kharagpur {nilaksh404, aayushjain}@kgpian.iitkgp.ac.in2Indian Institute of Science (IISc), Bangalore {abhishekr, shreenabhm}@iisc.ac.in3RBCCPS, IISc - pushpak@iisc.ac.in4CSA & RBCCPS, IISc - shishirk@iisc.ac.inEqual Contribution
Abstract

Reinforcement Learning (RL) has progressed from simple control tasks to complex real-world challenges with large state spaces. While RL excels in these tasks, training time remains a limitation. Reward shaping is a popular solution, but existing methods often rely on value functions, which face scalability issues. This paper presents a novel safety-oriented reward-shaping framework inspired by barrier functions, offering simplicity and ease of implementation across various environments and tasks. To evaluate the effectiveness of the proposed reward formulations, we conduct simulation experiments on CartPole, Ant, and Humanoid environments, along with real-world deployment on the Unitree Go1 quadruped robot. Our results demonstrate that our method leads to 1.4-2.8 times faster convergence and as low as 50-60% actuation effort compared to the vanilla reward. In a sim-to-real experiment with the Go1 robot, we demonstrated better control and dynamics of the bot with our reward framework. We have open-sourced our code at https://github.com/Safe-RL-IISc/barrier_shaping.

I INTRODUCTION

Reinforcement Learning (RL) has demonstrated substantial success in various domains, including gaming (e.g., Minecraft [1] and Atari [2]), language model optimization (e.g., Sparrow [3] and InstructGPT [4]), and robotics [5, 6]. However, all of these methods including RLHF [7], which addresses RL’s sample efficiency issues in other domains, can be costly when applied to robotics, including sim-to-real transfer scenarios. Despite having its limitations, reward shaping provides a simpler, more accessible and efficient alternative to address these issues even now. Well-crafted reward functions guide the agent’s behaviour towards desired outcomes, facilitating successful learning of the intended task. Previous works ([8], [9]) have established reward-shaping to accelerate algorithm convergence. Potential-based reward shaping [10] is a well-known work that suggests adding a potential function term initialized to the value function for reward-shaping. However, it needs a good estimate of the value function, which is challenging in the case of sparse reward models and in complex environments due to dimensionality [11], [12], [13].

Refer to caption
Figure 1: An example run of Humanoid with vanilla reward (Top) and the exponential barrier reward (Bottom), trained for the same number of time steps. Unlike the vanilla reward, barrier function-based reward leads to more natural and less wasteful movements.

With a view to improve upon existing works on reward shaping methods, we introduce Barrier Function (BF) inspired reward shaping, a simple, safety-focused framework that enhances training efficiency and safety. Our approach uses BFs to supplement the base reward in an environment. Intuitively, when a BF is positive, the system’s state is within the safe region. Accordingly, we can construct an inequality by using the derivative of the barrier function, which is in turn encoded in the form of a reward. This reward-shaping term encourages the RL agent’s states to not only remain within the safe zone, but also ensure that undesirable behaviors are avoided at the limits. In this paper, we propose two BF-based reward formulations: the exponential barrier and the quadratic barrier. We assess our framework across various environments, from CartPole[14] to Humanoid[15]. Additionally, our approach improves the walking performance of agents, as evident in Fig 1. We also demonstrate sim-to-real transferability by applying our framework to the Unitree Go1 robot [16]. The key highlights of the proposed framework are:

  • A safety-oriented, intuitive and easy-to-implement barrier function-inspired reward shaping framework.

  • The framework leads to faster convergence towards the goal and efficient state exploration by enforcing the system within the safe set.

  • It leads to lesser energy expenditure as the barrier function constrains the states within desired limits, thus avoiding extreme actions.

II Related Work

Reward Shaping : Positive linear transformation [17], [18], [19], leverages the principles of utility theory to enhance agent performance by adding rewards for state transitions as the difference between the values of the arbitrary potential functions applied to the respective states. In [20], a Bayesian approach to reward shaping is presented, which incorporates prior beliefs and adapts with experience. While effective, this method is model-based and may not generalize well across different environments. Signal temporal logic, explored in [21] and [22], can serve as a formal specification for reward formulation but is also model-dependent. The safety and stability of the agent and environment remain unaddressed. To the best of our knowledge, for the first time, we explore the agent’s safety and stability using reward shaping that ensures faster convergence along with less energy consumption during training and testing.

It is worth mentioning that, in our formulation, we omit energy terms from the shaped reward as they often clash with the environment’s goals. For example, in CartPole, including an energy term would prioritize having the pendulum vertically down, which isn’t the intended objective. Adjusting energy terms requires extra tuning, complicating things for larger robot models. Similarly, alternative reward-shaping approaches like angle limits and velocity limits may lead to undesirable behavior, keeping the agent within bounds but failing to achieve the main objective. Intuitively, there is a relationship between the positions and velocities that must be respected at the limits. Violation of this can restrict exploration and learning, hindering effective policy development. The CBF based constraints encode this relationship, thereby ensuring that the position-velocity values are not in conflict.

Safety : [23] introduces a framework that combines model-free reinforcement learning with model-based CBF controllers and learned system dynamics to ensure safety during exploration. Since it is model-based, a good knowledge of the system dynamics is required. [24] proposes using reachability constraints to expand the feasible set, resulting in a less conservative policy compared to CBF-based approaches. [25] explores a more data-driven approach using two neural networks to learn the controller and safety barrier certificate simultaneously achieving a verification-in-the-loop synthesis, but requires formal verification (using SMT solvers) to ensure constraint satisfaction. Lastly, [26] discusses learning to restructure an MDP reward function for accelerated reinforcement learning, but it can be computationally intensive. Tan et al. [27] link CBF to a value function to enforce verifiable safety. Our approach aligns with the idea of leveraging model-based constraints for improved performance, but it doesn’t require learning system dynamics, making it more generalizable. We also allow for adjustable constraint levels (soft or hard) to prioritize returns or constraint satisfaction. Furthermore, we eliminate the need for formal verification, as our formulation inherently satisfies constraints.

III Preliminary

III-A Reinforcement Learning

Reinforcement learning (RL) can be described as a discounted Markov decision process (MDP), defined by the tuple \mathcal{M}caligraphic_M = (𝒮,𝒜,𝒫,𝒮𝒜𝒫\mathcal{S,A,P,}caligraphic_S , caligraphic_A , caligraphic_P , r, γ𝛾\gammaitalic_γ), where 𝒮𝒮\mathcal{S}caligraphic_S is a set of states, 𝒜𝒜\mathcal{A}caligraphic_A is a set of actions, 𝒫:𝒮×A𝒮:𝒫𝒮𝐴𝒮\mathcal{P}:\mathcal{S}\times A\rightarrow\mathcal{S}caligraphic_P : caligraphic_S × italic_A → caligraphic_S is the deterministic state transition function, r:𝒮×A:r𝒮𝐴\emph{r}:\mathcal{S}\times A\rightarrow\mathbb{R}r : caligraphic_S × italic_A → blackboard_R is the reward function, and γ𝛾absent\gamma\initalic_γ ∈ (0,1) is the discount factor. In RL, the goal of the learning algorithm is to converge on a policy π𝜋\piitalic_π: 𝒮𝒜𝒮𝒜\mathcal{S}\rightarrow\mathcal{A}caligraphic_S → caligraphic_A that maximizes the total (discounted) reward after performing actions on an MDP, i.e., the objective is to maximize Σ0γtrtsuperscriptsubscriptΣ0superscript𝛾𝑡subscript𝑟𝑡\Sigma_{0}^{\infty}\gamma^{t}r_{t}roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the output of the reward function r𝑟ritalic_r for the sample at instance t𝑡titalic_t. Since a policy maps a state to an action, the value of a policy is evaluated according to the discounted cumulative reward. Given this MDP, we are interested in shaping the rewards by using barrier functions, described in the next section.

III-B Barrier Functions

For practical applications, we often want the system state s𝑠sitalic_s to stay within a safe region, denoted as a set 𝒞𝒞\mathcal{C}caligraphic_C. The set 𝒞𝒞\mathcal{C}caligraphic_C is defined as the super-level set of a continuously differentiable function h:𝒮n:𝒮superscript𝑛h:\mathcal{S}\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}italic_h : caligraphic_S ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R satisfying,

𝒞𝒞\displaystyle\mathcal{C}caligraphic_C ={s𝒮:h(s)0}absentconditional-set𝑠𝒮𝑠0\displaystyle=\{s\in\mathcal{S}:h(s)\geq 0\}= { italic_s ∈ caligraphic_S : italic_h ( italic_s ) ≥ 0 } (1)
𝒞𝒞\displaystyle\partial\mathcal{C}∂ caligraphic_C ={s𝒮:h(s)=0}absentconditional-set𝑠𝒮𝑠0\displaystyle=\{s\in\mathcal{S}:h(s)=0\}= { italic_s ∈ caligraphic_S : italic_h ( italic_s ) = 0 } (2)
Int(𝒞)Int𝒞\displaystyle\text{Int}\left(\mathcal{C}\right)Int ( caligraphic_C ) ={s𝒮:h(s)>0},absentconditional-set𝑠𝒮𝑠0\displaystyle=\{s\in\mathcal{S}:h(s)>0\},= { italic_s ∈ caligraphic_S : italic_h ( italic_s ) > 0 } , (3)

where Int(𝒞)Int𝒞\text{Int}\left(\mathcal{C}\right)Int ( caligraphic_C ) and 𝒞𝒞\partial\mathcal{C}∂ caligraphic_C denote the interior and the boundary of the set 𝒞𝒞\mathcal{C}caligraphic_C, respectively. It is assumed that Int(𝒞)Int𝒞\text{Int}\left(\mathcal{C}\right)Int ( caligraphic_C ) is non-empty and 𝒞𝒞\mathcal{C}caligraphic_C has no isolated points, i.e. Int(𝒞)ϕInt𝒞italic-ϕ\text{Int}\left(\mathcal{C}\right)\neq\phiInt ( caligraphic_C ) ≠ italic_ϕ and Int(𝒞)¯=𝒞¯Int𝒞𝒞\overline{\text{Int}\left(\mathcal{C}\right)}=\mathcal{C}over¯ start_ARG Int ( caligraphic_C ) end_ARG = caligraphic_C. The set 𝒞𝒞\mathcal{C}caligraphic_C is said to be forward invariant (safe) if s(0)𝒞s(t)𝒞t0for-all𝑠0𝒞𝑠𝑡𝒞for-all𝑡0\forall\>s(0)\in\mathcal{C}\implies s(t)\in\mathcal{C}\;\;\;\forall t\geq 0∀ italic_s ( 0 ) ∈ caligraphic_C ⟹ italic_s ( italic_t ) ∈ caligraphic_C ∀ italic_t ≥ 0. We can mathematically verify the safety of the set 𝒞𝒞\mathcal{C}caligraphic_C by establishing the existence of a barrier function. We have the following definition of a barrier function (BF) from [28].

Definition III.1

Given the set 𝒞𝒞\mathcal{C}caligraphic_C defined by (1)-(3), the function hhitalic_h is called the barrier function (BF) defined on the set 𝒮𝒮\mathcal{S}caligraphic_S if there exists an extended class 𝒦𝒦\mathcal{K}caligraphic_K function κ𝜅\kappaitalic_κ such that for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S:

h˙(s,s˙)+κ(h(s))0.˙𝑠˙𝑠𝜅𝑠0\displaystyle\dot{h}\left(s,\dot{s}\right)\!+\kappa\left(h(s)\right)\!\geq\!0.over˙ start_ARG italic_h end_ARG ( italic_s , over˙ start_ARG italic_s end_ARG ) + italic_κ ( italic_h ( italic_s ) ) ≥ 0 . (4)

Here κ:(,)(,):𝜅\kappa:\left(-\infty,\infty\right)\rightarrow(-\infty,\infty)italic_κ : ( - ∞ , ∞ ) → ( - ∞ , ∞ ) is a strictly increasing continuous function, with κ𝜅\kappaitalic_κ(0) = 0. κ𝜅\kappaitalic_κ is widely called an extended class 𝒦𝒦\mathcal{K}caligraphic_K function. Note that h˙(s,s˙):=hss˙assign˙𝑠˙𝑠𝑠˙𝑠\dot{h}(s,\dot{s}):=\frac{\partial h}{\partial s}\dot{s}over˙ start_ARG italic_h end_ARG ( italic_s , over˙ start_ARG italic_s end_ARG ) := divide start_ARG ∂ italic_h end_ARG start_ARG ∂ italic_s end_ARG over˙ start_ARG italic_s end_ARG, where s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG is the time derivative of s𝑠sitalic_s. Even if the MDP is in discrete time, s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG, and consequently h˙˙\dot{h}over˙ start_ARG italic_h end_ARG, can be calculated approximately as h˙hs(stst1)/Δt˙𝑠subscript𝑠𝑡subscript𝑠𝑡1Δ𝑡\dot{h}\approx\frac{\partial h}{\partial s}(s_{t}-s_{t-1})/\Delta tover˙ start_ARG italic_h end_ARG ≈ divide start_ARG ∂ italic_h end_ARG start_ARG ∂ italic_s end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) / roman_Δ italic_t, where st,st1subscript𝑠𝑡subscript𝑠𝑡1s_{t},s_{t-1}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are the samples of the states obtained at time steps t𝑡titalic_t and t1𝑡1t-1italic_t - 1, and ΔtΔ𝑡\Delta troman_Δ italic_t is the time interval between two samples. It is worth mentioning that the classical definition has the actions or controls u𝑢uitalic_u as one of the arguments along with model in (4), thereby making it a control barrier function (CBF). For this paper, we avoid the use of inputs and make estimates of the derivative of hhitalic_h by using s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG. This allows us to use barrier functions in a model-free way and is sufficient for the presented work as our focus is on reward shaping.

If we are able to restrict the states of the system s,s˙𝑠˙𝑠s,\dot{s}italic_s , over˙ start_ARG italic_s end_ARG in such a way that the inequality (4) is satisfied, then we know that the set 𝒞𝒞\mathcal{C}caligraphic_C is forward invariant (safe). We can use this idea to shape the reward functions in such a way that any violation of safety causes a loss of reward. We will formally show the reward-shaping methodology in the next section.

IV Reward Shaping Methodology

IV-A Reward Shaping using Barrier Functions

Having described BFs and their associated formal results, we now discuss BF-inspired reward shaping in the context of RL. Reward shaping is a method for engineering a reward function to provide more frequent feedback on appropriate behaviours. To illustrate our framework, we propose the shaped reward rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in (5), which we obtain by adding a term to the traditional vanilla reward r𝑟ritalic_r.

r(s,s˙)=r(s)+rBF(s,s˙)additional reward ,superscript𝑟𝑠˙𝑠𝑟𝑠subscriptsuperscript𝑟BF𝑠˙𝑠additional reward \displaystyle r^{\prime}\left(s,\dot{s}\right)=r(s)+\underbrace{r^{\text{BF}}% \left(s,\dot{s}\right)}_{\text{additional reward }},italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , over˙ start_ARG italic_s end_ARG ) = italic_r ( italic_s ) + under⏟ start_ARG italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT ( italic_s , over˙ start_ARG italic_s end_ARG ) end_ARG start_POSTSUBSCRIPT additional reward end_POSTSUBSCRIPT , (5)

rBF(s,s˙)superscript𝑟BF𝑠˙𝑠r^{\text{BF}}(s,\dot{s})italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT ( italic_s , over˙ start_ARG italic_s end_ARG ) is our barrier function inspired reward shaping term. We emphasise that the new reward rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT depends on s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG. From Definition III.1, 𝒞𝒞\mathcal{C}caligraphic_C is forward invariant if and only if there exists a barrier function hhitalic_h such that it satisfies (4). Thus we define rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT as

rBF(s,s˙)=h˙(s,s˙)+γh(s).superscript𝑟BF𝑠˙𝑠˙𝑠˙𝑠𝛾𝑠\displaystyle r^{\text{BF}}(s,\dot{s})=\dot{h}(s,\dot{s})+\gamma h(s).\vspace{% -0.25cm}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT ( italic_s , over˙ start_ARG italic_s end_ARG ) = over˙ start_ARG italic_h end_ARG ( italic_s , over˙ start_ARG italic_s end_ARG ) + italic_γ italic_h ( italic_s ) . (6)

In order to satisfy (4) we have taken the extended class 𝒦𝒦\mathcal{K}caligraphic_K function κ𝜅\kappaitalic_κ as κ(m)=γm𝜅𝑚𝛾𝑚\kappa(m)=\gamma mitalic_κ ( italic_m ) = italic_γ italic_m, γ>0𝛾subscriptabsent0\gamma\in\mathbb{R}_{>0}italic_γ ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT. This gives our shaping term rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT the desirable property of positively rewarding the agent when it is in a set of safe states 𝒞𝒞\mathcal{C}caligraphic_C while negatively rewarding otherwise.

The BF hhitalic_h: 𝒮𝒮\mathcal{S}\rightarrow\mathbb{R}caligraphic_S → blackboard_R is chosen to be a suitable function that constrains specific quantities in 𝒮𝒮\mathcal{S}caligraphic_S, such that it would lead to desirable properties like actuation safety and training efficiency. The following section provides specific examples of BF-based reward shaping.

Remark IV.1

Since, in RL task, our goal is to find a policy π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A}italic_π : caligraphic_S → caligraphic_A that maximizes total (discounted) reward, as we maximise the proposed reward (5) it will also enforce the condition (4) in Definition III.1 and thus implies safe execution of the task. However, it is essential to note that this approach does not provide a safety guarantee and does not imply safety during training.

IV-B Barrier Function Formulation

Refer to caption
Figure 2: Plots illustrating the proposed barrier functions (7)-(8). Dashed lines represent the constraint limits (-1,1).

In an RL task, some state variables should ideally lie between a safe range of values. Violating these bounds can result in undesirable behaviours. To constrain them within their safe bounds, we must use an appropriate barrier function, h(s,𝜹)h(s)𝑠𝜹𝑠h(s,\bm{\delta})\equiv h(s)italic_h ( italic_s , bold_italic_δ ) ≡ italic_h ( italic_s ) parameterized by 𝜹𝜹\bm{\delta}bold_italic_δ. We propose two barrier functions: a quadratic function hquadsubscriptquadh_{\text{quad}}italic_h start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT (7) and an exponential function hexpsubscriptexph_{\text{exp}}italic_h start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT (8).

hquad(s,𝜹)subscriptquad𝑠𝜹\displaystyle h_{\text{quad}}(s,\bm{\delta})italic_h start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ( italic_s , bold_italic_δ ) =lδa(slslmax)(slminsl)absentsubscript𝑙subscript𝛿𝑎subscript𝑠𝑙subscriptsuperscript𝑠max𝑙subscriptsuperscript𝑠min𝑙subscript𝑠𝑙\displaystyle=\sum_{l\in\mathcal{L}}\delta_{a}\left(s_{l}-s^{\text{max}}_{l}% \right)\left(s^{\text{min}}_{l}-s_{l}\right)= ∑ start_POSTSUBSCRIPT italic_l ∈ caligraphic_L end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_s start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ( italic_s start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) (7)
hexp(s,𝜹)subscriptexp𝑠𝜹\displaystyle h_{\text{exp}}(s,\bm{\delta})italic_h start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT ( italic_s , bold_italic_δ ) =lδa[1(eδb(slslmax)+eδb(slminsl))]absentsubscript𝑙subscript𝛿𝑎delimited-[]1superscript𝑒subscript𝛿𝑏subscript𝑠𝑙subscriptsuperscript𝑠max𝑙superscript𝑒subscript𝛿𝑏subscriptsuperscript𝑠min𝑙subscript𝑠𝑙\displaystyle=\sum_{l\in\mathcal{L}}\delta_{a}\left[1-\left(e^{\delta_{b}\left% (s_{l}-s^{\text{max}}_{l}\right)}+e^{\delta_{b}\left(s^{\text{min}}_{l}-s_{l}% \right)}\right)\right]= ∑ start_POSTSUBSCRIPT italic_l ∈ caligraphic_L end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ 1 - ( italic_e start_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_s start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) ] (8)

where 𝜹=[δa,δb]𝜹subscript𝛿𝑎subscript𝛿𝑏\bm{\delta}=[\delta_{a},\delta_{b}]bold_italic_δ = [ italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ], 𝜹>02𝜹subscriptsuperscript2absent0\bm{\delta}\in\mathbb{R}^{2}_{>0}bold_italic_δ ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT is the vector parameter, and ={l0,l1,l2,}subscript𝑙0subscript𝑙1subscript𝑙2\mathcal{L}=\{l_{0},l_{1},l_{2},\dots\}caligraphic_L = { italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … } is the set of indices for the elements of the state variables of the model whose values we want to constrain. In particular, {sl}subscript𝑠𝑙\{s_{l}\}{ italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } is lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of state vector s𝑠sitalic_s, which is the value of the state variable upon which we wish to enforce the bounds (slminsubscriptsuperscript𝑠min𝑙s^{\text{min}}_{l}italic_s start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, slmaxsubscriptsuperscript𝑠max𝑙s^{\text{max}}_{l}italic_s start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT). The choice of 𝜹𝜹\bm{\delta}bold_italic_δ and the bounds (slminsubscriptsuperscript𝑠min𝑙s^{\text{min}}_{l}italic_s start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, slmaxsubscriptsuperscript𝑠max𝑙s^{\text{max}}_{l}italic_s start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) is elaborated in the following section.

Since h(s,𝜹)𝑠𝜹h(s,\bm{\delta})italic_h ( italic_s , bold_italic_δ ) is positive for an l𝑙l\in\mathcal{L}italic_l ∈ caligraphic_L only when sl(slmin,slmax)subscript𝑠𝑙superscriptsubscript𝑠𝑙minsuperscriptsubscript𝑠𝑙maxs_{l}\in(s_{l}^{\text{min}},s_{l}^{\text{max}})italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ ( italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT ), and negative otherwise, h(s,𝜹)𝑠𝜹h(s,\bm{\delta})italic_h ( italic_s , bold_italic_δ ) qualifies as a barrier function and rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT can be computed by Eq. (6).

For appropriate values of 𝜹𝜹\bm{\delta}bold_italic_δ, hexpsubscriptexph_{\text{exp}}italic_h start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT is flat within the bounds and tapers down sharply outside them, which allows for better exploration while staying within the bounds. In contrast, the shape of hquadsubscriptquadh_{\text{quad}}italic_h start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT makes it more suitable for tasks that require slsubscript𝑠𝑙s_{l}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to be constrained near the central values (Fig. 2).

Refer to caption
Figure 3: Density plot of the quadratic BF reward, rquadBFsubscriptsuperscript𝑟BFquadr^{\text{BF}}_{\text{quad}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT constructed using (6) and (7) with (-1,1) as the bounds on s𝑠sitalic_s. Notice that the reward depends on s𝑠sitalic_s and s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG. The solid and dotted contour lines correspond to positive and negative values of the reward, respectively.

The reward rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT is a function of both s𝑠sitalic_s and s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG. Thus, it jointly constrains them depending on their values (Fig. 3). To get a more intuitive understanding, consider the following,

rBF(s,s˙)=h˙(s,s˙)+γh(s)=h(s)sdsdt+γh(s)=h(s)ss˙+γh(s)superscript𝑟BF𝑠˙𝑠˙𝑠˙𝑠𝛾𝑠𝑠𝑠𝑑𝑠𝑑𝑡𝛾𝑠𝑠𝑠˙𝑠𝛾𝑠\displaystyle\begin{split}r^{\text{BF}}(s,\dot{s})&=\dot{h}(s,\dot{s})+\gamma h% (s)\\ &=\frac{\partial h(s)}{\partial s}\frac{ds}{dt}+\gamma h(s)=\frac{\partial h(s% )}{\partial s}\dot{s}+\gamma h(s)\end{split}start_ROW start_CELL italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT ( italic_s , over˙ start_ARG italic_s end_ARG ) end_CELL start_CELL = over˙ start_ARG italic_h end_ARG ( italic_s , over˙ start_ARG italic_s end_ARG ) + italic_γ italic_h ( italic_s ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ∂ italic_h ( italic_s ) end_ARG start_ARG ∂ italic_s end_ARG divide start_ARG italic_d italic_s end_ARG start_ARG italic_d italic_t end_ARG + italic_γ italic_h ( italic_s ) = divide start_ARG ∂ italic_h ( italic_s ) end_ARG start_ARG ∂ italic_s end_ARG over˙ start_ARG italic_s end_ARG + italic_γ italic_h ( italic_s ) end_CELL end_ROW (9)
Refer to caption
Figure 4: Energy expended to stabilize initial angles for each reward formulation in the cartpole environment
Refer to captionRefer to caption
Figure 5: θpsubscript𝜃𝑝\theta_{p}italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and x-position vs time-step graph showing control flow to stabilize -40 pole angle in the cart pole environment. The policy clearly shows that the vanilla policy struggles to stabilize the pole angle close to zero, while the quadratic policy accomplishes this in a few time-steps.

Taking the partial derivative of rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT with respect to s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG gives,

rBF(s,s˙)s˙=s˙h(s)ss˙+s˙(γh(s))=h(s)s=h(s)superscript𝑟BF𝑠˙𝑠˙𝑠˙𝑠𝑠𝑠˙𝑠˙𝑠𝛾𝑠𝑠𝑠superscript𝑠\displaystyle\begin{split}\frac{\partial r^{\text{BF}}(s,\dot{s})}{\partial% \dot{s}}&=\frac{\partial}{\partial\dot{s}}\frac{\partial h(s)}{\partial s}\dot% {s}+\frac{\partial}{\partial\dot{s}}\left(\gamma h(s)\right)=\frac{\partial h(% s)}{\partial s}=h^{\prime}(s)\end{split}start_ROW start_CELL divide start_ARG ∂ italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT ( italic_s , over˙ start_ARG italic_s end_ARG ) end_ARG start_ARG ∂ over˙ start_ARG italic_s end_ARG end_ARG end_CELL start_CELL = divide start_ARG ∂ end_ARG start_ARG ∂ over˙ start_ARG italic_s end_ARG end_ARG divide start_ARG ∂ italic_h ( italic_s ) end_ARG start_ARG ∂ italic_s end_ARG over˙ start_ARG italic_s end_ARG + divide start_ARG ∂ end_ARG start_ARG ∂ over˙ start_ARG italic_s end_ARG end_ARG ( italic_γ italic_h ( italic_s ) ) = divide start_ARG ∂ italic_h ( italic_s ) end_ARG start_ARG ∂ italic_s end_ARG = italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) end_CELL end_ROW (10)

Now, if the barrier function h(s)𝑠h(s)italic_h ( italic_s ) is taken to be a concave function, then there exists an s0(smin,smax)subscript𝑠0superscript𝑠minsuperscript𝑠maxs_{0}\in(s^{\text{min}},s^{\text{max}})italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( italic_s start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT ) such that h(s)>0superscript𝑠0h^{\prime}(s)>0italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) > 0 for all s<s0𝑠subscript𝑠0s<s_{0}italic_s < italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and negative otherwise. Thus, from (10) we can infer that when s<s0𝑠subscript𝑠0s<s_{0}italic_s < italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT i.e. it is near the lower bound, (/s˙)rBF=h(s)˙𝑠superscript𝑟BFsuperscript𝑠(\partial/\partial\dot{s})r^{\text{BF}}=h^{\prime}(s)( ∂ / ∂ over˙ start_ARG italic_s end_ARG ) italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT = italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) is positive. As an RL policy wants to maximize rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT, it promotes s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG to be positive. Given that s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG denotes the rate of change of s𝑠sitalic_s over time, a positive s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG causes s𝑠sitalic_s to increase, moving it away from the lower bound sminsuperscript𝑠mins^{\text{min}}italic_s start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT. Conversely, the opposite effect applies when s𝑠sitalic_s is near smaxsuperscript𝑠maxs^{\text{max}}italic_s start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT. In both cases, s𝑠sitalic_s is promoted to move towards s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Thus, rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT works with both s𝑠sitalic_s and s˙˙𝑠\dot{s}over˙ start_ARG italic_s end_ARG to promote safety. This effect is lacking with other reward shaping methods that constrain only s𝑠sitalic_s within some bounds for safety. It can be verified that both of our proposed barriers hquadsubscriptquadh_{\text{quad}}italic_h start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT and hexpsubscriptexph_{\text{exp}}italic_h start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT are concave functions with s0=(smax+smin)/2subscript𝑠0superscript𝑠maxsuperscript𝑠min2s_{0}=(s^{\text{max}}+s^{\text{min}})/2italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_s start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT + italic_s start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT ) / 2.

V Simulation Experiments

We experimentally evaluate our BF-based reward-shaping formulation on OpenAI Gym’s Cartpole [29], and MuJoCo environments like Half-Cheetah [30], Humanoid [15], and Ant [14]. We used the Twin Delayed DDPG (TD3) [31] algorithm with two variants of the BF-based reward shaping: rquadBFsubscriptsuperscript𝑟BFquadr^{\text{BF}}_{\text{quad}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT (7) and rexpBFsubscriptsuperscript𝑟BFexpr^{\text{BF}}_{\text{exp}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT (8), which yield the πquadBFsubscriptsuperscript𝜋BFquad\pi^{\text{BF}}_{\text{quad}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT and πexpBFsubscriptsuperscript𝜋BFexp\pi^{\text{BF}}_{\text{exp}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT policies respectively. The environment’s vanilla reward is used as the baseline. All experiments were performed on a Ryzen Threadripper CPU, 64GB RAM, and an NVIDIA RTX 3080.

V-A Cartpole

Refer to caption
Refer to caption
Refer to caption
Figure 6: The plots for each MuJoCo walker environment, averaged for ten random seeds, show the episodic velocity for each training time step. Since these environments aim to achieve as high a velocity as possible, these plots provide a good metric to judge the training speed.
Refer to caption
Refer to caption
Refer to caption
Figure 7: The episodic energy (in joules) spent by the agent for achieving a particular velocity, averaged over ten random seeds. We see that the policy trained with exponential BF-based reward shaping performs the best across all the environments, consuming the least energy.

Reward Shaping: The state-space contains the pole angle θpsubscript𝜃𝑝\theta_{p}italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, pole angular velocity ωpsubscript𝜔𝑝\omega_{p}italic_ω start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, cart’s position xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and velocity vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The task consists of balancing a pole attached to a moving cart, i.e. we want θp=0subscript𝜃𝑝0\theta_{p}=0italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0. The vanilla reward is r𝑟ritalic_r = +1 for every step taken. We define (θpmin,θpmax)(slmin,slmax)superscriptsubscript𝜃𝑝minsuperscriptsubscript𝜃𝑝maxsuperscriptsubscript𝑠𝑙minsuperscriptsubscript𝑠𝑙max(\theta_{p}^{\text{min}},\theta_{p}^{\text{max}})\equiv(s_{l}^{\text{min}},s_{% l}^{\text{max}})( italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT ) ≡ ( italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT ) as the pole angle’s desired threshold range for the goal (θpsubscript𝜃𝑝\theta_{p}italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0) around which we want the pendulum to stabilize, hence h=h(θp,𝜹)subscript𝜃𝑝𝜹h=h(\theta_{p},\bm{\delta})italic_h = italic_h ( italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_italic_δ ). This bound is taken as the maximum angle deviation ϕitalic-ϕ\phiitalic_ϕ from which the pole can balance itself. According (5)-(6), the quadratic BF (7) shaped reward is given as, (taking γ𝛾\gammaitalic_γ = 1, θpmax=θpmin=ϕsuperscriptsubscript𝜃𝑝maxsuperscriptsubscript𝜃𝑝minitalic-ϕ\theta_{p}^{\text{max}}=-\theta_{p}^{\text{min}}=\phiitalic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT = - italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT = italic_ϕ)

rquad=1+h˙quadcart(s,s˙)+γhquadcart(s)=1+δa((ϕ2θ2)2θpωp),δa>0formulae-sequencesubscriptsuperscript𝑟quad1subscriptsuperscript˙cartquad𝑠˙𝑠𝛾subscriptsuperscriptcartquad𝑠1subscript𝛿𝑎superscriptitalic-ϕ2superscript𝜃22subscript𝜃𝑝subscript𝜔𝑝subscript𝛿𝑎subscriptabsent0\displaystyle\begin{split}r^{\prime}_{\text{quad}}&=1+\dot{h}^{\text{cart}}_{% \text{quad}}(s,\dot{s})+\gamma h^{\text{cart}}_{\text{quad}}(s)\\ &=1+\delta_{a}((\phi^{2}-\theta^{2})-2\theta_{p}\omega_{p}),\delta_{a}\in\ % \mathbb{R}_{>0}\end{split}start_ROW start_CELL italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT end_CELL start_CELL = 1 + over˙ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT cart end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ( italic_s , over˙ start_ARG italic_s end_ARG ) + italic_γ italic_h start_POSTSUPERSCRIPT cart end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ( italic_s ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 1 + italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ( italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 2 italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT end_CELL end_ROW (11)

Results: Fig. 5 shows energy expended till stabilisation vs initial angle for a range of angles. As can be read, the performance of policy using vanilla reward is energy-expensive compared to that of with rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT. For one unit of energy spent by the vanilla policy, πexpBFsubscriptsuperscript𝜋BFexp\pi^{\text{BF}}_{\text{exp}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT policy spends 0.79 units, πquadBFsubscriptsuperscript𝜋BFquad\pi^{\text{BF}}_{\text{quad}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT policy spends 0.59 units. Fig. 5 describes the control performance for the same initial state. In contrast to πquadBFsubscriptsuperscript𝜋BFquad\pi^{\text{BF}}_{\text{quad}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT policy, the Vanilla policy exhibits chaotic behaviour with no convergence to the θp=0subscript𝜃𝑝0\theta_{p}=0italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0 value. Thus our πquadBFsubscriptsuperscript𝜋BFquad\pi^{\text{BF}}_{\text{quad}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT policy maintains safety during deployment. In Fig. 5 multiple barrier refers to rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT constructed on ωpsubscript𝜔𝑝\omega_{p}italic_ω start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT along with θpsubscript𝜃𝑝\theta_{p}italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

V-B MuJoCo Walker Environments

Reward Shaping: For walkers like Half-Cheetah [30], Ant [14], and Humanoid [15], the task is to learn to run as fast as possible. We use the same rBFsuperscript𝑟BFr^{\text{BF}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT formulation for all these environments as essentially they all have a similar state space 𝒮𝒮\mathcal{S}caligraphic_S containing xposwsubscriptsuperscript𝑥𝑤𝑝𝑜𝑠x^{w}_{pos}italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT, ΘwsuperscriptΘ𝑤\Theta^{w}roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and ΩwsuperscriptΩ𝑤\Omega^{w}roman_Ω start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT, where ΘwsuperscriptΘ𝑤\Theta^{w}roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and ΩwsuperscriptΩ𝑤\Omega^{w}roman_Ω start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT are the set of agent’s joint angles and angular velocities, respectively. The vanilla reward r𝑟ritalic_r given by (12) has multiple terms that guide the agent to move forward (rforwardsubscript𝑟forwardr_{\text{forward}}italic_r start_POSTSUBSCRIPT forward end_POSTSUBSCRIPT) without falling (rhealthsubscript𝑟healthr_{\text{health}}italic_r start_POSTSUBSCRIPT health end_POSTSUBSCRIPT) and minimizing the contact forces (rcontactsubscript𝑟contactr_{\text{contact}}italic_r start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT).

r=rhealth+rforward+rcontact.𝑟subscript𝑟healthsubscript𝑟forwardsubscript𝑟contact\displaystyle r=r_{\text{health}}+r_{\text{forward}}+r_{\text{contact}}.italic_r = italic_r start_POSTSUBSCRIPT health end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT forward end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT . (12)

Since the task involves running, we seek to constrain the agent’s joint angles θlΘwsubscript𝜃𝑙superscriptΘ𝑤\theta_{l}\in\Theta^{w}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT within a range of angles (θlmin,θlmax)superscriptsubscript𝜃𝑙minsuperscriptsubscript𝜃𝑙max(\theta_{l}^{\text{min}},\theta_{l}^{\text{max}})( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT min end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT max end_POSTSUPERSCRIPT ) lfor-all𝑙\forall l\in\mathcal{L}∀ italic_l ∈ caligraphic_L. In this case, \mathcal{L}caligraphic_L is the set of all joint angles. Thus, θlsubscript𝜃𝑙\theta_{l}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is analogous to slsubscript𝑠𝑙s_{l}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. These bounds correspond to the range of angles within which a joint can safely turn, violating which could lead to damage to the actuators or collision with other parts of the body. These general bounds are specified in the robot description and do not require additional domain knowledge. The optimal parameters 𝜹𝜹\bm{\delta}bold_italic_δ for (7)-(8) were found by performing a grid search.

Metrics: The metrics used to evaluate the policy performance are (i𝑖iitalic_i) Actuation Coefficient, computed from the episodic energy-velocity curve (Fig. 7), (ii𝑖𝑖iiitalic_i italic_i) Training Speed, computed from the velocity-timestep curve (Fig. 7). All the experiments were repeated for ten random seeds.

The energy wsubscript𝑤\mathcal{E}_{w}caligraphic_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT used by a walker w𝑤witalic_w in a time-step is calculated by

w=j𝒥wΔθjτjsubscript𝑤subscript𝑗subscript𝒥𝑤Δsubscript𝜃𝑗subscript𝜏𝑗\displaystyle\mathcal{E}_{w}=\sum_{j\in\mathcal{J}_{w}}\Delta\theta_{j}\cdot% \tau_{j}caligraphic_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_J start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Δ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (13)

where 𝒥wsubscript𝒥𝑤\mathcal{J}_{w}caligraphic_J start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is the set of all joints of the walker, τjsubscript𝜏𝑗\tau_{j}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the torque exerted by the agent on the joint j𝑗jitalic_j, and ΔθjΔsubscript𝜃𝑗\Delta\theta_{j}roman_Δ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the change in the joint angle. The episodic energy is the sum of wsubscript𝑤\mathcal{E}_{w}caligraphic_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over an episode. Since the task of these environments is to maximise the velocity vwsubscript𝑣𝑤v_{w}italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, we define the Actuation Coefficient (given in Table. LABEL:tab:main-res) as

(Actuation Coefficient)w=wepisodic(vwmean)2subscript(Actuation Coefficient)𝑤superscriptsubscript𝑤episodicsuperscriptsubscriptsuperscript𝑣mean𝑤2\displaystyle\text{(Actuation Coefficient)}_{w}=\frac{\mathcal{E}_{w}^{\text{% episodic}}}{\left(v^{\text{mean}}_{w}\right)^{2}}(Actuation Coefficient) start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = divide start_ARG caligraphic_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT episodic end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_v start_POSTSUPERSCRIPT mean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (14)

This term represents the energy expended or the effort made by the agent to achieve a certain kinetic energy.

Results: Table LABEL:tab:main-res shows that for Humanoid, πexpBFsubscriptsuperscript𝜋BFexp\pi^{\text{BF}}_{\text{exp}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT policy takes only about 49% actuation energy to achieve the same kinetic energy as the vanilla policy. Fig. 7 (Left) shows that the πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT policies reach a higher maximum velocity for the same training time-steps. Fig. 7 (Left) shows that πexpBFsubscriptsuperscript𝜋BFexp\pi^{\text{BF}}_{\text{exp}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT policy converges to a higher velocity in much fewer training time-steps. Table LABEL:tab:main-res shows that πexpBFsubscriptsuperscript𝜋BFexp\pi^{\text{BF}}_{\text{exp}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT policy converges 1.56 times faster, while πquadBFsubscriptsuperscript𝜋BFquad\pi^{\text{BF}}_{\text{quad}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT policy is no worse than vanilla policy. Continuing the trends from Table LABEL:tab:main-res, we see that the πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT policies on Ant and Half-cheetah share similarities with the results in the Humanoid environment. These similarities include reduced kinetic energy, quicker convergence, and increased agent velocity within the same time-step frame.

VI SIM-TO-REAL ON HARDWARE

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Training plots for the Go1 robot in the Issac Gym simulation environment. The top row show that our policy πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT trained using rexpBFsubscriptsuperscript𝑟BFexpr^{\text{BF}}_{\text{exp}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT outperforms the policy trained without it (vanilla) in all three tasks - Velocity Tracking, Angle Tracking, and Jumping. The angle tracking reward is inversely proportional to the error in angle. Note that for both the policies we are comparing the same vanilla reward in the plot. Action smoothness is negative of change in action. The bottom row provides insights about the safety and efficiency of πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Top row shows a sample run for the velocity tracking task using πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT policy trained for 2.5k episodes. Bottom row shows the vanilla policy performance, trained for 5k episodes. Frames are 0.4 sec apart. The video can be found here: http://www.stochlab.com/redirects/rewardshaping2023.html.

VI-A Implementation Details

Hardware: We deploy our trained policies on the Unitree Go1 robot [16], a quadruped with 12 DOFs. We rely on the SDK provided by [32] to communicate between our code and the low-level control SDK provided by Unitree. The control frequency is 50Hz for both simulation and hardware.

Simulation and Training: We use an open source implementation of the Go1 in the Isaac Gym simulator [33]. We train the policies for 2500 episodes on 4000 parallel environments using PPO [34] with both the Vanilla Reward and rexpBFsubscriptsuperscript𝑟BFexpr^{\text{BF}}_{\text{exp}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT constructed using hexpsubscriptexph_{\text{exp}}italic_h start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT (8), having same formulations as given in V-B for other waslkers. The joint angle bounds were taken from the physical specifications of the robot.

Domain Randomization: To enhance sim-to-real transfer, we train a policy that remains robust across variations in robot attributes such as body mass, motor strength, joint position calibration, ground friction, restitution, and gravity orientation and magnitude by varying these quantities.

VI-B Simulation and Training Results

Refer to captionRefer to caption
Figure 10: Action values for two front shoulder flexion joints of Go1 for Vanilla Policy (left) and πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT Policy (right) for a sample run of the velocity tracking task. The action values represent angle targets that are later clipped to a suitable range. It can be seen that the actions for πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT are smoother, rhythmic and have lower action value.

The Vanilla reward is comprised of three components: target velocity tracking, target angle tracking, and target height jump. As depicted in Fig. 9 (Top Row), our proposed policy πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT trained with the rexpBFsubscriptsuperscript𝑟BFexpr^{\text{BF}}_{\text{exp}}italic_r start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT shaping term added to the vanilla reward consistently outperforms the Vanilla policy across all three tasks. In Fig. 9 (Bottom Row), it’s evident that πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT utilizes only 78% of the energy compared to Vanilla, as calculated using (13). Furthermore, πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT exhibits improved action smoothness (negative of change in action) and significantly lower maximum action values, thereby enhancing safety for deployment.

VI-C Hardware Results

Vanilla Policy: The Vanilla policy (Fig. 9 Bottom) resulted in unstable movements and limb instability, particularly affecting the front right limb during basic commands like forward motion and re-direction (Fig. 9). This led to a fall, from which the robot autonomously recovered but at the cost of posing practical deployment risks. The robot’s joint movements lacked coordination and efficiency, frequently deviating from the intended path and causing the front hinge joints to approach dangerously close to the ground.

πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT Policy: The policy trained with the new reward significantly enhanced the Go1 robot’s performance, with improved balanced and coordinated movements despite having half the training duration as the Vanilla policy (Fig. 9 Top). It flawlessly followed RC controller commands, ensuring safety without falls or risky behaviour. The advantages of the πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT policy are further affirmed by Fig. 10 where πBFsuperscript𝜋BF\pi^{\text{BF}}italic_π start_POSTSUPERSCRIPT BF end_POSTSUPERSCRIPT policy leads to more consistent and rhythmic action values. This characteristic enhances safety for the actuators and contributes to increased task efficiency.

VII CONCLUSIONS

In this paper we propose barrier function (BF) inspired reward shaping, a safety-oriented, easy-to-implement reward shaping formulation for robotic platforms. This approach is based on theoretical principles of safety provided by barrier functions. The shaping term aims to encourage agents to remain within predefined safe states during training. This enhances training efficiency and ensures safer exploration. To illustrate our formulation process, we proposed two barrier functions: exponential and quadratic.

While prior works, e.g. by Cheng et al. [23] achieved high training efficiency and state safety, their framework needs the system dynamics model. In contrast, our method eliminates this need, thus being easy to implement in complex environments. However, a limitation of our study is that we have only tested with barrier functions of joint angles. Further investigation into other quantities such as joint angular velocities could offer a more comprehensive understanding of our reward’s effectiveness.

We employed the Unitree Go1 robot [16] as our hardware platform for sim-to-real experiments. Our reward-shaping methodology emerged superior through comparative analysis with [32], revealing smoother, more rhythmic control dynamics. The results indicate that our formulation is an easy way to introduce safety and efficiency in RL training.

References

  • [1] D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse domains through world models,” arXiv preprint arXiv:2301.04104, 2023.
  • [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [3] A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
  • [4] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  • [5] S. Arimoto, S. Kawamura, and F. Miyazaki, “Bettering operation of robots by learning,” Journal of Robotic systems, vol. 1, no. 2, pp. 123–140, 1984.
  • [6] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  • [7] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
  • [8] B. Marthi, “Automatic shaping and decomposition of reward functions,” in Proceedings of the 24th International Conference on Machine Learning, ser. ICML ’07.   New York, NY, USA: Association for Computing Machinery, 2007, p. 601–608.
  • [9] M. Grzes and D. Kudenko, “Plan-based reward shaping for reinforcement learning,” in 2008 4th International IEEE Conference Intelligent Systems, vol. 2, 2008, pp. 10–22–10–29.
  • [10] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Icml, vol. 99.   Citeseer, 1999, pp. 278–287.
  • [11] Y. Dong, X. Tang, and Y. Yuan, “Principled reward shaping for reinforcement learning via lyapunov stability theory,” Neurocomputing, vol. 393, pp. 83–90, 2020.
  • [12] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete event dynamic systems, vol. 13, no. 1-2, pp. 41–77, 2003.
  • [13] R. Bellman, “Dynamic programming,” science, vol. 153, no. 3731, pp. 34–37, 1966.
  • [14] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [15] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2012, pp. 4906–4913.
  • [16] Unitree, “Go1,” 2022. [Online]. Available: https://www.unitree.com/products/go1
  • [17] M. Grzes and D. Kudenko, “Plan-based reward shaping for reinforcement learning,” in 2008 4th International IEEE Conference Intelligent Systems, vol. 2.   IEEE, 2008, pp. 10–22.
  • [18] S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping,” in Proceedings of the 11th international conference on autonomous agents and multiagent systems.   IFAAMAS, 2012, pp. 433–440.
  • [19] M. Grzes, “Reward shaping in episodic reinforcement learning,” 2017.
  • [20] O. Marom and B. Rosman, “Belief reward shaping in reinforcement learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  • [21] A. Balakrishnan and J. V. Deshmukh, “Structured reward shaping using signal temporal logic specifications,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 3481–3486.
  • [22] N. Saxena, G. Sandeep, and P. Jagtap, “Funnel-based reward shaping for signal temporal logic tasks in reinforcement learning,” IEEE Robotics and Automation Letters, 2023.
  • [23] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3387–3395, Jul. 2019.
  • [24] D. Yu, H. Ma, S. E. Li, and J. Chen, “Reachability constrained reinforcement learning,” 2022.
  • [25] H. Zhao, X. Zeng, T. Chen, Z. Liu, and J. Woodcock, “Learning safe neural network controllers with barrier certificates,” 2020.
  • [26] B. Marthi, “Automatic shaping and decomposition of reward functions,” in Proceedings of the 24th International Conference on Machine learning, 2007, pp. 601–608.
  • [27] D. C. Tan, F. Acero, R. McCarthy, D. Kanoulas, and Z. A. Li, “Your value function is a control barrier function: Verification of learned policies using control theory,” arXiv preprint arXiv:2306.04026, 2023.
  • [28] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in 2019 18th European control conference (ECC).   IEEE, 2019, pp. 3420–3431.
  • [29] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983.
  • [30] P. Wawrzyński, “A cat-like robot real-time learning to run,” in Adaptive and Natural Computing Algorithms: 9th International Conference, ICANNGA 2009, Kuopio, Finland, April 23-25, 2009, Revised Selected Papers 9.   Springer, 2009, pp. 380–390.
  • [31] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International conference on machine learning.   PMLR, 2018, pp. 1587–1596.
  • [32] G. B. Margolis and P. Agrawal, “Walk these ways: Tuning robot control for generalization with multiplicity of behavior,” Conference on Robot Learning, 2022.
  • [33] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa et al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” arXiv preprint arXiv:2108.10470, 2021.
  • [34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.