A Review of Safe Reinforcement Learning Methods for Modern Power Systems

Tong Su, , Tong Wu, , Junbo Zhao, ,
Anna Scaglione, , Le Xie This work is supported by the U.S. Department of Energy Solar Energy Technologies Office under award 37770. Tong Su and Junbo Zhao are with the Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT 06269, USA (e-mail: tongsu@uconn.edu; junbo@uconn.edu). Tong Wu and Anna Scaglione are with the Department of Electrical and Computer Engineering, Cornell Tech, Cornell University, New York City, NY 10044, USA (e-mail: tw385@cornell.edu; as337@cornell.edu). Le Xie is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA (e-mail: le.xie@tamu.edu).

Abstract

Due to the availability of more comprehensive measurement data in modern power systems, there has been significant interest in developing and applying reinforcement learning (RL) methods for operation and control. Conventional RL training is based on trial-and-error and reward feedback interaction with either a model-based simulated environment or a data-driven and model-free simulation environment. These methods often lead to the exploration of actions in unsafe regions of operation and, after training, the execution of unsafe actions when the RL policies are deployed in real power systems. A large body of literature has proposed safe RL strategies to prevent unsafe training policies. In power systems, safe RL represents a class of RL algorithms that can ensure or promote the safety of power system operations by executing safe actions while optimizing the objective function. While different papers handle the safety constraints differently, the overarching goal of safe RL methods is to determine how to train policies to satisfy safety constraints while maximizing rewards. This paper provides a comprehensive review of safe RL techniques and their applications in different power system operations and control, including optimal power generation dispatch, voltage control, stability control, electric vehicle (EV) charging control, buildings’ energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Additionally, the paper discusses benchmarks, challenges, and future directions for safe RL research in power systems.

Index Terms:

Safe reinforcement learning, machine learning, power system operation, power system control, energy management, optimal power generation dispatch, EV charging, voltage control.

Nomenclature

Notations

$\gamma$ γがんま: Discount factor $\gamma\in[0,1)$ γがんま ∈ [ 0 , 1 )
$\Delta$ Δでるた\Deltaroman_Δでるた: Difference operator
$\delta$ δでるた: Rotor angle
$\epsilon/A$: Inertia parameter of temperature and thermal conductivity of HVAC
$\varepsilon$ εいぷしろん: Safety constraint bound
$\zeta$ ζぜーた: Safety probability ( $1-\zeta$ ζぜーた is the the risk probability)
$\eta,\eta^{\text{CHP}}_{p/h}$ ηいーた , italic_ηいーた start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p / italic_h end_POSTSUBSCRIPT: Efficiency of charging or discharging, electrical/thermal energy efficiency of CHP
$\theta$ θしーた: Parameters of the policy $\pi_{\theta}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT
$\vartheta$: Grid state in the DC-PF approximation
$\bm{\Lambda}^{\text{EV}}_{\text{ch/dis}}$ Λらむだ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch/dis end_POSTSUBSCRIPT: Charging/selling electricity price of EV
$\bm{\Lambda}^{\text{Ele/Gas/Car}}$ Λらむだ start_POSTSUPERSCRIPT Ele/Gas/Car end_POSTSUPERSCRIPT: Price of electricity/gas/carbon
$\lambda$ λらむだ: Penalty coefficient or Lagrange multiplier
$\Pi_{S}$ Πぱい𝑆\Pi_{S}roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT: Policy set
$\pi_{\theta}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT, $\pi_{\theta}^{\text{adv}}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT: Parameterized policy, policy of adversary
$\pi_{\theta}^{k}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, $\pi_{\theta}^{k+\frac{1}{2}}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT: Policy at iteration $k$ , intermediate policy between iterations $k$ and $k+1$
$\rho_{0}$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: $\rho_{0}:\mathcal{S}\rightarrow[0,1]$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : caligraphic_S → [ 0 , 1 ] is starting state distribution of $\mathcal{S}$
$\tau$ τたう: Trajectory $\tau=(s_{0},a_{0},s_{1},\ldots)$ τたう = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … )
$\bm{\omega}$ ωおめが: Frequency
$\mathcal{A},\bm{a}$: Action set, action
$a^{\text{SG}}/b^{\text{SG}}/c^{\text{SG}}$: Fuel cost coefficients of SG
$\mathcal{B}/\mathcal{G}/\mathcal{R}$: BESS/SG/RES set
$\mathcal{C},C$: Constraint set $\mathcal{C}=\{(C_{i},\varepsilon_{i})\}^{m}_{i=1}$ εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, constraint cost function $C:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\textbf{R}$
$c^{\text{RES/BESS}}$: Cost coefficients of RES/BESS
$\text{ch}/\text{dis}$: Charging/discharging of electricity or thermal for ESS
$\mathbb{D}$: Function to extract the vector of diagonal elements from a matrix
$M,L,\frac{1}{R},D$: Inertia constant, load damping coefficient, speed droop response coefficient, $D=\frac{1}{R}+L$ is the combined frequency response coefficient from synchronous generators and load
$\mathbb{E},E,E_{\text{cap}}$: Expectation function, energy associated with devices, energy capacity of ESS
$\mathcal{E}/\mathcal{N}$: Edge/node set
$f,g,h$: State transition dynamics or the model of the environment, equality constraints with a total number of $m$ , inequality constraints with a total number of $n$ .
$G/N$: Cardinality of the set $\cal G/\cal N$
$\bm{g}$: Gas input of CHP or GB
$\mathcal{H}/*$: Hermitian/conjugate for a vector or matrix
$\bm{h}$: Thermal energy generation or load vector
$\bm{i}$: Current phasor vector
$\mathcal{J}_{R}^{\pi_{\theta}}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, $\mathcal{J}_{h_{i}}^{\pi_{\theta}}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT: Reward performance, constraint cost performance of inequality constraints
$\mathcal{L}$: Lagrangian
$\mathcal{M}$ , $\mathcal{M}_{C}$: MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,\rho_{0},\gamma)$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま ), CMDP $\mathcal{M}_{C}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま , caligraphic_C )
$\mathbb{P},\mathcal{P}$: Probability function, $\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]$ is the transition matrix, where $\mathcal{P}(s_{t+1}|s_{t},a_{t})$ denotes the probability of state transition from $s_{t}$ to $s_{t+1}$ after taking action $a_{t}$
$P^{\text{Load}}_{\text{his/pre}}$: Historical/current net load forecast
$P_{\text{res}}$: Reserve requirement
$\bm{p}/\bm{q}$: Active/reactive power generation or load vector
$\overline{\bm{p}}^{\text{Gen}}_{e}$: Maximum emergency power generation of generator
$\bm{p}^{\text{Bus}}$: Bus power injection
$p_{ij}/q_{ij}/s_{ij}$: Active/reactive/apparent power for branch $ij$
$\bm{p}_{e}/\bm{p}_{m}$: Electrical/mechanical power
$R$: Reward function $R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$
$\bm{R}_{\text{up/down}}$: Ramp-up/down rate of generators
$r_{ij}/x_{ij}$: Resistance/reactance of line $ij$
$\mathcal{S},\bm{s}_{\text{ap}},\bm{s}$: State set, apparent power vector, state
$\bm{S}_{\text{up/down}}$: Start-up/shut-down rate of generators
$\mathcal{T},t$: Time step set of trajectory $\tau$ τたう, time instant
$\overline{t}_{\text{up}}/\underline{t}_{\text{up}},t_{\text{tot}}$: Maximum/minimum up time of Gens, total time
$T,H,T^{I/O}$: Temperature, humidity, indoor/outdoor temperature
$\bm{u}_{\text{start/shut/com}}$: Startup/shutdown/commitment status of Gens
$\bm{v}/\bm{\phi}$: Voltage phasor/phase vector $\bm{v}_{t}=|\bm{v}|\odot e^{\mathfrak{j}\bm{\phi}}$ ,
$\mathbf{Y}/\mathbf{B}$: Admittance/susceptance matrix
$\overline{\ }/\underline{\ }$: Maximum/minimum values of the variable or vector

Abbreviations

AC/DC: Alternating current/direct current
ADN: Active Distribution Network
AMI: Advanced Metering Infrastructure
(B/M/T)ESS: (Battery/Mobile/Thermal) Energy Storage System
CHP: Combined Heat and Power system
(C)MDP: (Constrained) Markov Decision Process
CPO: Constrained Policy Optimization
CPPO: Constraint-controlled PPO
CS: Charging Station
CUP: Conservative Update Policy
DDPG: Deep Deterministic Policy Gradient
DG: Distributed Generation
DER: Distributed Energy Resource
(D/R)NN: (Deep/Recurrent) Neural Network
DSO: Distribution System Operator
(D/R)RL: (Deep/Robust) Reinforcement Learning
EHP: Electric Heat Pump
EV: Electric Vehicle
FACTS: Flexible AC Transmission System
FOCOPS: First Order Constrained Optimization in Policy Space
GCN: Graph Convolution Network
GB: Gas Boiler
Gen: Generator
GP: Gaussian Process
GPT: Generative Pre-trained Transformer
HVAC: Heating, Ventilation and Air-Conditioning
ICNN: Input Convex Neural Network
IPO: Interior-point Policy Optimization
Lag: Lagrangian methods
LLM: Large Language Model
MA(C): Multi-Agent (Constrained)
MIP: Mixed-Integer Linear
MPPT: Maximum Power Point Tracking
PCPO: Projection-based Constrained Policy Optimization
PDO: Primal-Dual Optimization
PILCO: Probabilistic Inference for Learning Control
PMU: Phasor Measurement Unit
PPO: Proximal Policy Optimization
p.u.: per unit
RES: Renewable Energy Source
RCPO: Reward Constrained Policy Optimization
SAC: Soft Actor-Critic
SafePO: Safe Policy Optimization
(SC)(O)PF: (Security Constrained) (Optimal) Power Flow
SG: Synchronous Generator
SoC: State of Change
TD3: Twin-Delayed Deep Deterministic policy gradient
TL: Thermal Load (such as room heater and water heater)
TR(PO/M): Trust Region (Policy Optimization/Method)
V2G: Vehicle-to-Grid
V, F: Voltage, Frequency

I Introduction

With the extensive integration of RESs, ESSs, and advanced power electronic devices, modern power systems are facing increased uncertainty and complexity, which translate to higher computational burden when modeling the stochastic non-linear nature of the control and decision problems. However, thanks to the widespread deployment of smart sensors, such as PMUs, along with advanced communication technologies, a vast amount of power system data can be measured and utilized for state estimation and control. As a result, data-driven approaches like RL have emerged as the key candidates for the numerical optimization of power systems decision and/or control policies[1], which would be otherwise intractable to derive. Conventionally, RL training is based on trial-and-error and reward feedback interaction with a model-based simulated environment [2] or a data-driven model-free simulated environment [3]. Recently, DRL, which embeds NNs as the policy function, has proven expressive enough to solve complicated control tasks. Additionally, the NN approach is used to reduce computation costs for online implementation. Once the NNs are trained, they approximate closed-form solutions and produce results quickly. However, nothing prevents the exploration of unsafe ranges during training and the execution of unsafe actions when the trained policies are deployed in real power systems. Therefore, the practical application of RL policies cannot be based on vanilla RL training [4].

In 2015, safe RL was first defined as “the process of learning policies that maximize the expectation of the reward in problems, where it is crucial to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes” [5]. Concurrently, the safe RL literature has been paid increasing attention. The methods can be coarsely divided into two categories: in one category the authors proposed to add to the reward function a safety factor that penalizes safety violations, and in the other category in the training phase the exploration process has been modified incorporating mechanisms that yield safe policies[5]. Based on these two approaches, numerous safe RL methods have been proposed and many have been applied and tailored for solving power systems decision and control problems, such as energy management, optimal power generation dispatch, EV Charging, voltage control, and others that this paper will cover in Section IV.

Reference [6] is currently the only paper that provides an overview of safe RL applications. However, the field is fast evolving and we aim to provide, first a comprehensive review of various safe RL techniques in general, and then a deep dive of their applications in power systems. The main contributions of the paper are as follows:

1.

This paper provides a comprehensive review of safe RL, covering its fundamental concepts, constraint classifications, existing algorithms, and benchmarks. It details the unique features and limitations of each RL algorithm, providing a foundation for future research endeavors in the domain of safe RL.
2.

Comprehensive review of the application of safe RL in power systems follows, covering almost all existing papers in this area. It categorizes these papers based on their application domains, listing each paper’s objectives, constraints, implemented safe RL techniques, environment types, and key features.
3.

We explore the key challenges and future research opportunities in safe RL for applications within power systems.

The framework of this paper is shown in Fig. 1. The rest of the paper is organized as follows. Section II introduces the CMDP and constraints. Section III provides a detailed introduction and classification of safe RL. Section IV offers a comprehensive review and comparative analysis of safe RL applications in different fields within the power system. Challenges and outlook are discussed in Section V and finally, Section VI concludes the paper.

Refer to caption — Figure 1: The framework of safe RL in power system application.

II Constrained Markov Decision Process

II-A Problem formulation

MDPs are defined by a tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma)$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま ) which are, respectively, the state space, action space, probability distribution, reward function, initial state $\rho_{0}\in{\cal S}$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S and discount factor. When the decision problem fits in an MDP, the objective is to determine the policy $\pi$ πぱい that maximizes the expected discounted reward $\mathcal{J}_{R}^{\pi_{\theta}}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, i.e.[4, 7, 8]:

\mathcal{J}_{R}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}R(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]

πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ]

(1)

where $\tau\sim\pi$ τたう ∼ italic_πぱい indicates that the distribution over trajectories depends on the policy $\pi$ πぱい; similarly $\bm{s}_{0}\sim\rho_{0}$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, $\bm{a}_{t}\sim\pi(\cdot|\bm{s}_{t})$ πぱい ( ⋅ | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), $\bm{s}_{t+1}\sim\mathcal{P}(\cdot|\bm{s}_{t},\bm{a}_{t})$ . Even if the transition probabilities and reward function are fully known, this task is often intractable. However, the approach taken normally is to learn the policy, using some parametrization.

The CMDP $\mathcal{M}_{C}=(\mathcal{S},\mathcal{A}_{t},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})$ ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま , caligraphic_C ) is an extension of a standard MDP, that addresses a frequent model variation: the case in which the action space ${\cal A}_{t}$ is a function of the state space ${\cal S}$ , i.e. $\bm{s}_{t}\mapsto{\cal A}_{t}$ , because the change in the environment affects what is a safe or feasible action, or due to the state-dependent cost of the action, which in the formulation needs to be below a threshold. This occurs in physical systems in which the boundary conditions, the state and the laws of physics limit what is feasible, what would lead to operations that are unsafe and how expensive is a certain agent action. In a nutshell, what differentiates the various instances of CMDP from a conventional MDP is the class of constraints that characterize the action space as a function of the system dynamics and the specific engineering problem and context that define the constraints. In this review, we define the CMDP for power system problems:

		$\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}$ Πぱい𝑆superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT ∈ roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT		(2)
	s.t.	$\displaystyle~{}~{}\bm{a}_{t}\text{ is feasible }$		(2)

where $\bm{a}_{t}$ is feasible not only means that $\bm{a}_{t}$ is constrained within its upper and lower limits, but also that the resulting $\bm{s}_{t}$ falls within specified feasible sets. In power systems, constraints on the upper and lower bounds of $\bm{a}_{t}$ relate to the control ranges of various controllable devices, such as the power output of SGs, RESs, and ESSs, as well as the temperature setpoint of HVAC systems, which can typically be enforced by simply restricting the action space of RL. $\bm{s}_{t}$ falls within specified feasible sets means that the state adheres to safe and stable operation constraints, such as boundary constraints of voltages, line flows, and building temperatures, as well as stability constraints of voltages, frequency, and rotor angles. Due to the highly non-linear and non-convex nature of power systems, obtaining feasible $\bm{a}_{t}$ that guarantees feasible $\bm{s}_{t}$ is challenging. This is also the main challenge of training safe RL.

II-B Constraints

II-B1 Instantaneous Constraints

Instantaneous constraints are prevalent in power systems. For instance, in the optimal power generation dispatch of power systems, we encounter constraints such as power flow, dynamic limitations associated with BESSs, voltage magnitude bounds, and power generation limits, as detailed in Section IV-A. Another instance is voltage control, which incorporates additional voltage droop control dynamics and stability constraints, described in Section IV-B. We also explore other examples such as stability control, EV charging control, and building energy management in Section IV. In general, these constraints can be expressed as follows:

	$\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}$ Πぱい𝑆superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT ∈ roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT	(3)
$\displaystyle\text{s.t.}~{}~{}g_{j}(\bm{s}_{t},\bm{a}_{t},$	$\displaystyle\bm{s}_{t+1})=0,~{}~{}j=1,\cdots,m$
$\displaystyle~{}h_{k}(\bm{s}_{t},\bm{a}_{t},$	$\displaystyle\bm{s}_{t+1})\leq 0,~{}~{}k=1,\cdots,n$

where the control action must fulfill both the $m$ equality and $n$ inequality constraints. We incorporate the terms $\bm{s}_{t}$ and $\bm{s}_{t+1}$ within these constraints to represent the time-varying bounds of $\bm{a}_{t}$ . Additionally, the dynamical constraints are also integrated into the aforementioned constraints.

II-B2 Cumulative Constraints

Cumulative constraints mandate that the sum or average of a specific cost signal remains within prescribed limits, calculated from the beginning of an event to the present time. Examples include total revenue and network throughput. These constraints are commonly applied in robot locomotion and manipulation, as discussed in [9]. Although several studies have attempted to adapt these constraints to power systems as a more flexible alternative to hard constraints, the application remains limited. For instance, [10] employs a discounted cumulative formulation in (4) to establish safety constraints in the management of distribution networks. In particular, they relax instantaneous constraints, such as voltage bounds, SoC bounds, and power quality, to a discounted cumulative formulation. Similarly, [11, 12] also utilize this approach. However, such constraints may not fully capture all safety requirements, though they do offer a partial enhancement of safety measures, providing some benefit over no constraints at all. The reason these studies do not consider instantaneous constraints is that cumulative relaxation offers a straightforward method to adapt constrained RL techniques, originally developed for robot locomotion and manipulation, to power systems. This approach not only simplifies implementation but also provides methodological insights that could potentially be extended to handle instantaneous constraints in future research.

To make the review more self-contained, we will review three kinds of cumulative constraints. In [13], the constraints for safe RL are divided into cumulative constraints and instantaneous constraints. For cumulative constraints, they are further categorized as discounted cumulative constraints (4), mean valued constraints (5), and probabilistic constraints (6). The discounted cumulative constraint is of the form:

\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]\leq% \varepsilon_{i}

πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ≤ italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

(4)

where $\varepsilon_{i}$ εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the limit for each cumulative constraint.

The mean valued constraint is of the form:

\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\frac{1}{t_{% \text{tot}}}\sum_{t=0}^{t_{\text{tot}}-1}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t% +1})\right]\leq\varepsilon_{i}

πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ≤ italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

(5)

where $t_{\text{tot}}$ is the total number of time steps in each trajectory.

The second group concerns the probability that the cumulative costs violate a constraint [13]. Probabilistic constraints are of the form:

\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{P}\left[\sum_{t}h_{i}(\bm{s}_{t},% \bm{a}_{t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zeta

πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_P [ ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≥ italic_ζぜーた

(6)

where $\eta_{i}$ ηいーた start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the cumulative cost threshold for each trajectory and $\varepsilon_{i}\in(0,1)$ εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is the probability limit.

Here, it is important to emphasize again that in power systems, the majority of constraints must be satisfied at every instant, thus they are commonly implemented as instantaneous constraints. For example, [14] utilizes the expected discounted reward, whereas constraints related to branch power flow and security operations are treated as instantaneous constraints.

II-C Constraints in Power Systems: Overview

In power system applications, the classification of constraints into instantaneous and cumulative constraints is related to the required degree of constraint satisfaction and the safe RL algorithms used. Typically, bus balance equations, upper and lower power limits of various equipment, ESS capacity constraints, certain voltage amplitude constraints, and some stability constraints are considered hard constraints. Safe RL algorithms capable of ensuring the satisfaction of hard constraints include projection method III-B, Lyapunov method III-C, shielding method III-E, safety layer method III-F and barrier function method III-G. For example, [15] uses the logarithmic barrier function to make the $\mathcal{J}_{h_{i}}^{\pi_{\theta}}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT approach infinity when voltage exceeds bounds, thereby satisfying hard voltage constraints. Due to discrepancies between models and real systems, various uncertainties of RESs and loads, and algorithmic shortcomings, even if constraints are theoretically satisfied, they may not be guaranteed in actual deployment. Therefore, GP methods III-D and RRL III-G have been proposed, using the probabilistic/chance constraint (6). However, their application in power systems remains underexplored. A more common approach is to use constrained game-theoretic RL within RRL [14, 16]. Furthermore, by design some safe RL algorithms can only encourage constraint satisfaction while maximizing rewards. Such algorithms include Lagrangian relaxation III-A and penalty functions. For example, [17] uses the voltage constraint metric $\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\sum_{i\in\cal N}\max\left\{|\bm{v}_{i,t}-1% |-0.05|,0\right\}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT roman_max { | bold_italic_v start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - 1 | - 0.05 | , 0 } and employs Lagrangian relaxation for voltage control, which cannot guarantee absolute adherence to voltage constraints, thus classifying it as a soft constraint. For some constraints, instead, such as user satisfaction with EV charging and voltage control at certain nodes, the goal is to approach standard values as closely as possible, making them inherently soft constraints. The illustrations of different constraints of safe RL are shown in Fig. 2.

III Safe Reinforcement Learning

Safe RL is often formulated as a CMDP problem, where the objective is to maximize the reward of agents while ensuring that the agents satisfy safety constraints [18, 4]. Safe RL is categorized into different types from various perspectives. This section primarily categorizes these types based on the techniques used to ensure constraint satisfaction and provides detailed introductions of the techniques and benchmarks.

III-A Lagrangian Relaxation / Primal-Dual Method

Lagrangian relaxation, also known as primal-dual method, is the most common technique in safe RL. The key idea of this method is to transform the CMDP problem into an unconstrained dual problem. This is achieved by employing adaptive Lagrange multipliers to penalize constraints [19]:


	$\displaystyle\textbf{Instantaneous}:~{}$
	$\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot h_{i}\right]$ λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT caligraphic_L ( italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θしーた ) = roman_min start_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT [ italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]		(7a)
	$\displaystyle\textbf{Cumulative}:~{}$
	$\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot\left(J_{h_{i}}^{\pi_{\theta}}-\varepsilon_{i}\right)\right]$ λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT caligraphic_L ( italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θしーた ) = roman_min start_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT [ italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( italic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]		(7b)

The solution of (7) relies on Danskin’s theorem and convex analysis [20]. Due to its straightforward implementation and compatibility with both on-policy and off-policy methods, Lagrangian relaxation has been integrated with other RL algorithms, fostering the creation of numerous variants, such as DDPG-Lag, PPO-Lag, TRPO-Lag, TD3-Lag, SAC-Lag, MAPPO, RCPO, PDO, TRPO-PID, CPPO-PID, DDPG-PID, TD3-PID, SAC-PID [21, 22, 19, 23].

The Lagrangian relaxation method is the most commonly used approach in power systems, capable of being easily integrated with various algorithms for application across a wide range of domains. Based on instantaneous or hard constraints, [24] utilizes a primal-dual approach to optimize the control of power generation and BESS charging and discharging actions in a multi-stage real-time stochastic dynamic OPF. Additionally, [25] applies constrained SAC to the Volt-VAR control problem by synergistically combining the merits of the maximum-entropy framework, the method of multipliers, a device-decoupled neural network structure, and an ordinal encoding scheme. Furthermore, [26] employs constrained RL for the predictive control of OPF, paired with EV charging control. On the other hand, based on cumulative or soft constraints, [27] approximates the actor gradients by solving the Karush-Kuhn-Tucker conditions of the Lagrangian, instead of constructing reward critic networks and cost critic networks through interactions with the environment. Then, the interior point method is incorporated to derive the parameter updating rule for the DRL agent. Similarly, [28] develops a soft-constraint enforcement method to adaptively encourage the control policy in the safety direction with nonconservative control actions and find decisions with near-zero degrees of constraint violations.

III-B Projection Method / Trust Region Method

The TRM ensures constraint satisfaction at every step and enhances performance by updating the trust region policy gradient and projecting the policy into a safe feasible set during each iteration [29]. Typical projection methods include CPO [9], PCPO [30], FOCOPS [31], CUP [32], and MACPO[22], among which PCPO is implemented through a two-step process: first, conducting a local reward update, and then projecting the policy back onto the constraint set to address any constraint violations, as depicted in Fig. 3.

In the power system domain, TRMs have also seen widespread application. For instance, [33] introduced a projection-embedded MA-DRL algorithm that smoothly and effectively restricts the DRL agent action space to prevent any violations of physical constraints, thereby achieving decentralized optimal control of distribution grids with a guaranteed 100% safety rate. Additionally, in the area of EV charging problems, [34] utilizes a penalty function to penalize the neural network output if it exceeds the action space and uses a projection operator to avoid incurring a negative reward when no EV is occupying the charging bay. In addition, [35] employs CPO for volt-VAR control to minimize the total operation costs while satisfying the physical operation constraints. However, TRMs, primarily based on TRPO or PPO, are not easily integrated with other RL types and are computationally intensive in high dimensions, limiting their suitability for large-scale safe RL problems [36].

III-C Lyapunov Method

Lyapunov functions, widely used in control engineering for controller design [37], were first applied to safe RL in [38]. The application of the Lyapunov method in power systems is limited because it requires prior knowledge of a Lyapunov function. If the model of environmental dynamics is unknown, identifying a suitable Lyapunov function can be challenging. For example, [39] integrates a Lyapunov function into the structural properties of primary frequency controllers, guaranteeing local asymptotic stability over a large set of states. Additionally, [40] utilizes Lyapunov theory to design the controller that satisfies specific Lipschitz constraints for decentralized inverter-based voltage control. In addition, [41] utilizes a stability-constrained RL method for real-time voltage control in distribution grids, providing a formal voltage stability guarantee using the Lyapunov function.

III-D Gaussian Process Method

GP [42] is widely utilized in numerous approaches to estimate uncertainty and identify unsafe areas. Consequently, assessments based on GP can be incorporated into the learning process to enhance agent safety [43]. GP-based safe RL algorithms include SafeOpt [44] and PILCO [45]. The application of GP method-based safe RL in power systems is limited, meriting further research to adequately address the various uncertainties inherent in power systems. The potential disadvantage of GP methods is their computational complexity and scalability issues, especially as the dimensionality of the problem space increases [36].

III-E Shielding Method

In [46], the shield is introduced for the first time in RL. This shield is computed in advance, based on the safety component of the system specification provided and an abstraction of the dynamics of the agent’s environment. It guarantees safety with minimal interference, implying that the shield limits the agent’s actions as little as necessary, only prohibiting actions that could jeopardize the safe behavior of the system. The shielded RL is shown in Fig. 4.

Shielding is a method that enforces constraint satisfaction, making it highly suitable for power system problems with hard constraints. For instance, in [47], actions that would lead to dangerous states, such as the SoC of BESSs being fully charged or depleted, are substituted by the shielding mechanism with safe actions to maintain system stability. Additionally, [48] combines a correction model adapted from gradient descent with the prediction model as a post-posed shielding mechanism to enforce safe actions in computer room air conditioning unit control problems. In addition, in unit commitment scheduling, [49] utilizes action space clipping to ensure that uncertainty estimates are reasonable and within appropriate bounds obtained from historical data. A potential drawback of the shielding method is the challenge of identifying feasible, safe actions based on infeasible ones, which requires underlying knowledge of the system. This can be difficult for certain complex systems or specific control scenarios [36].

III-F Safety Layer Method

Both the safety layer and shielding method integrate safety into the RL process, but they differ in their implementation: the safety layer acts as an additional check within the RL framework, whereas shielding employs an external system (the shield) that intervenes only when necessary to prevent unsafe actions. The safety layer method, first proposed in [50] for continuous action spaces in RL, emphasizes maintaining zero-constraint violations throughout the learning process. It expresses safety constraints as linear functions of action through a first-order approximation. Assuming that at most one constraint is violated at any time, an analytical solution to the safety layer optimization problem can be directly obtained. The linearization equation and visualization of the safety layer are shown in (8) and 5, respectively.

\overline{h}_{i}(s_{t+1})\triangleq h_{i}(s_{t},a_{t})\approx\overline{h}_{i}(% s_{t})+g(s_{t};w_{i})^{T}a_{t}

(8)

where $w_{i}$ are weights of NN; $g(s_{t};w_{i})$ denotes first-order approximation to $h_{i}(s_{t},a_{t})$ with respect to $a_{t}$ .

The safety layer method has been widely applied in power systems. For example, in optimal power generation dispatch, [51] proposes a hybrid knowledge-data-driven safety layer to convert unsafe actions into the safety region, which is accelerated by a security-constrained linear projection model. Additionally, in volt-VAR control, [52] adds a safety layer to the policy neural network to enhance operational constraint satisfaction during both the initial exploration phase and the convergence phase. In addition, [53] uses action clipping, reward shaping, and expert demonstrations to ensure safe exploration and accelerate the training process during the online training stage for the assist service restoration problem. However, the linear approximation in the safety layer might not accurately capture the complexities of underlying dynamics in highly non-linear systems, and iterating at every time step could introduce a significant computational burden. Moreover, assuming only one constraint at a time may not be valid in complex environments where multiple safety constraints are concurrently active.

III-G Barrier Function Method

The barrier function method involves adding a barrier function penalty term to the original objective function. When the system state approaches the safety boundary, the value of the constructed barrier function tends to infinity, thereby ensuring that the state remains within the safe boundary [54]. The most typical barrier function method is IPO, which augments the objective with logarithmic barrier functions, drawing inspiration from the interior-point method [55]:


$\displaystyle\textbf{Instantaneous}:$	$\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -h_{i})$ θしーた end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )	(9a)
$\displaystyle\textbf{Cumulative}:$	$\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -J_{h_{i}}^{\pi_{\theta}}+\varepsilon_{i})$ θしーた end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( - italic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )	(9b)

where $t_{i}$ is a hyperparameter for $h_{i}$ . The illustration of IPO is shown in Fig. 6.

Barrier function method and IPO have been widely applied in power systems to ensure the safety of constraints. For example, [12] utilizes IPO to ensure the fulfillment of distribution network constraints without the need for designated penalty terms and the associated tuning of penalty factors, or repeatedly solving optimization problems for action rectification. Additionally, [56] uses IPO to facilitate desirable learning behavior towards constraint satisfaction and policy improvement simultaneously during online preventive control for transmission overload relief. In addition, [57] proposes a safe RL method for emergency load shedding in power systems, where the reward function includes a barrier function that approaches negative infinity as the system state approaches safety bounds. However, the accurate formulation and tuning of barrier functions necessitate knowledge of system dynamics, which can be challenging in complex environments.

III-H Robust Reinforcement Learning

One of the challenges in RL is generalization under uncertainties not seen during training. To address this, RRL frameworks have been developed, focusing on enhancing the reliability and robustness of RL agents for the worst-case scenarios [58, 59]. Two notable approaches in this context are chance-constrained RRL and constrained game-theoretic RL. It is important to note that RRL is not universally recognized as a safe RL algorithm in other fields. However, due to the significant uncertainties in power systems, RRL is employed to enhance control robustness and is reviewed here.

III-H1 Chance-constrained RRL

Chance-constrained RRL, in particular, focuses on ensuring that policies perform well under uncertain conditions by incorporating probabilistic constraints into the learning process [60]. In this framework, the goal is not just to maximize expected rewards but to do so while ensuring that the probability of undesirable outcomes (e.g., safety violations) remains below a specified threshold [61]. This is particularly important in scenarios where safety and reliability are critical, such as autonomous driving or robotics [62]. The general form can be expressed as:

	$\displaystyle\max_{\pi}\mathcal{J}_{R}^{\pi_{\theta}}{}{}{}{}{}{}{}{}{}{}{}{}{% }{}{}{}{}{}{}{}{}$ πぱい end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT		(10)
	$\displaystyle\text{s.t.}~{}~{}\mathbb{P}\left[\min_{i}h_{i}(\bm{s}_{t},\bm{a}_% {t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zeta,\forall t\in\mathcal{T}$ εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≥ italic_ζぜーた , ∀ italic_t ∈ caligraphic_T		(10)

III-H2 Constrained game-theoretic RL

Constrained game-theoretic RL is a framework that models the interaction between the RL agent and its environment as a game, specifically focusing on scenarios where there are constraints that the agent must respect during the learning and decision-making processes [63]. The objective is to maximize the agent’s rewards while minimizing the possible losses or costs, considering the worst-case scenarios posed by adversaries’ actions or environmental uncertainties [64]. Here’s a more accurate representation using a minimax optimization framework [63]:

	$\displaystyle\min_{\pi_{\theta}^{\text{adv}}}\max_{\pi_{\theta}}$ πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUBSCRIPT	$\displaystyle~{}\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s% _{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\right]$ τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ]		(11)
	$\displaystyle\text{s.t.}~{}~{}h_{i}$	$\displaystyle(s_{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\leq 0,\forall t\in% \mathcal{T}$		(11)

One of the key benefits of constrained game-theoretic RL is its ability to handle competitive and cooperative interactions within complex environments, making it suitable for applications ranging from strategic games to cooperative multi-agent scenarios like mobile edge computing [65] and coordination in robotic teams [66].

RRL is applied in power systems to ensure that control strategies remain robust under various uncertainties. For example, [14] utilizes adversarial safe RL to address the model inaccuracy and uncertainty of virtual power plants without relying on an accurate environmental model. Additionally, in the sequential OPF problem, [51] employs a bi-level robust optimization approach to optimize the training loss of the Q network. In addition, in the inverter-based volt-VAR control problem, [16] develops a highly efficient adversarial RL algorithm to train an offline agent that is robust to model mismatches during the offline stage.

III-I Benchmarks

Benchmarks include both benchmark environments and benchmark algorithms. Safety Gym, developed by OpenAI, is the first widely recognized safe benchmark environment. It includes an environment-builder and a suite of pre-configured benchmark environments [21, 67]. Correspondingly, Safety Starter Agents, a benchmark algorithm library, has been developed based on Safety Gym [68]. The supported algorithms in this library include PPO, PPO-Lag, TRPO, TRPO-Lag, SAC, SAC-Lag, and CPO. This package has been tested on Mac OS Mojave and Ubuntu 16.04 LTS and is likely compatible with most recent Mac and Linux operating systems.

Safety Gymnasium, an update and extension of Safety Gym, has currently become the mainstream platform in use [69, 70]. Correspondingly, a benchmark repository for safe RL algorithms has been proposed, named SafePO [71]. SafePO is tested on the Linux platform and potentially supports Mac or Windows, requiring only modifications to the Linux path and sort functions for compatibility.

SafePO further extends the variety of supported safe RL algorithms, as illustrated in Fig. 7.

OmniSafe emerges as the first unified learning framework in the field of safe RL, featuring a highly modular framework that includes a comprehensive collection of algorithms specifically developed for safe RL across various domains. Its versatility comes from an abstracted algorithm structure and a well-designed API, facilitating seamless integration of different components, thereby simplifying extension and customization for developers. Additionally, OmniSafe enhances algorithm learning speeds through process parallelism, supporting both environment-level and agent asynchronous parallel learning. OmniSafe is supported and tested on Linux and also supports M1 and M2 versions of macOS. However, it does not support Windows [72, 73]. The supported safe RL algorithms of OmniSafe are shown in Table I.

TABLE I: Supported Safe RL Algorithms of OmniSafe

Domains	Types	Algorithms Registry
On Policy	Primal-Dual	TRPO-Lag; PPO-Lag; PDO; RCPO
	Convex Optimization	CPO; PCPO; FOCOPS; CUP
	Penalty Function	IPO; P3O
	Primal	OnCRPO
Off Policy	Primal-Dual	DDPG-Lag; TD3-Lag; SAC-Lag
		DDPG-PID; TD3-PID; SAC-PID
Model-based	Online Plan	SafeLOOP; CCEPETS; RCEPETS
	Pessimistic Estimate	CAPPETS
Offline	Q-Learning-Based	BCQ-Lag; C-CRR
	DICE-Based	COptDICE
	ET-MDP	PPO/TRPO-EarlyTerminated
Other MDP	SauteRL	PPOSaute; TROPSaute
	SimmerRL	PPOSimmer-PID; TROPSimmer-PID

Overall, Safety Gymnasium is the current mainstream benchmark environment, and OmniSafe has also integrated Safety Gymnasium to ensure overall code compatibility. It is important to remark that Safety Gymnasium was primarily developed for control in gaming, robotics, autonomous driving, etc., featuring a series of agents such as point, car, dog, and ant, among others. It offers several specific environments tailored for challenges such as safe navigation, safe velocity, and safe vision, but it is not directly applicable to power systems problems’ formulations. Hence, there is a need to develop corresponding power system control environments based on the environment templates provided by Safety Gymnasium. In terms of benchmark algorithms, OmniSafe offers a more comprehensive set of algorithms but currently does not support Windows due to difficulties with Python library installations. In contrast, SafePO is more easily expanded on Windows. Since most power system professional software is developed for Windows, with less support for Linux and macOS, this may limit the application of OmniSafe in model-based environments. However, if surrogate models are used to substitute for physical models in a model-free environment, OmniSafe can be utilized in Linux or macOS.

IV Power System Applications of Safe RL

This review synthesizes a broad collection of studies and applications of safe RL in power systems, covering a wide array of domains: optimal power generation dispatch, voltage control, stability control, EV charging control, building energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Safe RL algorithms used in various application domains are presented in Fig. 1. As depicted in Fig. 8, RL-based schemes collect power system measurements, including PMU and AMI readings, and integrate system model knowledge into their policy training. They take action to control power system devices, ensuring safety requirements like feasibility, stability, and robustness are met. The research problem or objective function, constraint, constraint type (cumulative/instantaneous and hard/soft), applied safe constraint techniques, and key features are reviewed to compare different researches using safe RL across various domains.

IV-A Optimal Power Generation Dispatch

TABLE II: Safe RL Applications in Optimal Power Generation Dispatch

Research Problem/ Objective Constraint Constraint Type Safety Constraint Techniques Key Features [27] Minimize the total generation cost Physical operation constraints Cum/Soft Primal-dual method (III-A) Combines the primal-dual DDPG with the classic SCOPF model. The actor gradients are approximated by solving the Karush-Kuhn-Tucker conditions of the Lagrangian. [24] Minimize the fuel costs and power loss from BESSs Physical constraints Ins/Hard Projection (III-B) and primal-dual method (III-A) A primal-dual approach is introduced to learn optimal constrained DRL policies specifically for predictive control in real-time stochastic dynamic OPF. [74] Minimize the total system cost Physical constraints Cum/Hard Safety layer (III-F) Unsafe actions are projected into the safe action space while constrained zonotope set is used to improve efficiency. [75] Minimize the cost of thermal power MESS Power grid and MESSs constraints Ins/Hard Proximal gradient projection (III-B) MESSs are modeled as CMDP, and a framework is proposed based on a DRL algorithm that considered the discrete-continuous hybrid action space of the MESSs. [15] Minimize the total energy cost Power system constraints Cum/Hard Lagrange relaxation (III-A) and logarithmic barrier (III-G) Function approximation addresses large, continuous state spaces, while a diffusion strategy coordinates actions of DG units and ESSs. [76] Minimize the generator fuel cost Power system constraints Ins/Hard Safety layer (III-F) The proposed method uses physics-driven parameters for easy modification and less conservative, easily re-parameterizable actions. [77] Minimize the operating cost Power system constraints Ins/Hard Safety layer (III-F) To avoid line overload, a safety layer is added by introducing transmission constraints to avoid dangerous actions and tackle sequential security-constrained OPF problem. [10] Minimize the total operating cost Physical constraints of system and devices Cum/Hard CPO (III-B) To optimize both discrete and continuous actions, a stochastic policy based on a joint distribution of mixed random variables is designed and learned through a NN approximator. [11] Minimize the total cost of operation of microgrids Global and local constraints Cum/Soft Lagrangian relaxation (III-A) and projection (III-B) The training process employs the gradient information of operational constraints to ensure that the optimal control policy functions generate safe and feasible decisions. [78] Minimize the operational cost Operation and power balance constraints Cum/Hard CPO (III-B) and invalid action masking (III-E) Invalid action masking is applied to avoid invalid actions, accomplished by replacing the logits of the actions to be masked with a large negative number. [79] Minimize the total operational cost AC-PF constraints Cum/Hard CPO (III-B) Contrary to traditional DRL methods, the proposed method constrains exploration to only those policies that comply with AC-PF constraints. [28] Minimize the total operational cost Gas system and power system constraints Cum/Soft Lagrangian relaxation (III-A) The penalty is adaptively updated based on the extent of constraint violation, facilitating the prediction of near-optimal control actions that achieve near-zero degrees of violation. [80] Minimize the operating cost for the whole horizon Operational constraints Ins/Hard MIP formulation The action-value function, approximated through a DNN, is structured as a MIP formulation, enabling the inclusion of constraints within the action space. [81] Optimize the total generation cost Operational and linguistic stipulation constraints N.A./Soft Primal-dual method (III-A) For the first time, a GPT LLM is integrated into the OPF framework alongside linguistic rules. This novel approach models and quantifies natural language stipulations as objectives and constraints within a primal-dual DRL loop. [82] Minimize the total operation cost Operational constraints N.A./Soft Lagrangian relaxation (III-A) Instead of using the critic network, the deterministic gradient is derived analytically and solved by using interior point method. [83] Minimize the total energy cost Satisfaction of the energy demand Cum/Soft Lagrangian relaxation (III-A) and RRL (III-H) This approach efficiently uses short-horizon forecasts to prevent energy demand failures and reduce costs, surpassing the capabilities of standard safe RL methods. [12] Minimize the costs of DGs production and RES curtailment Constraints of distribution network Cum/Hard IPO (III-G) The generalization of IPO is improved by extracting spatial-temporal features from microgrid operation data, leveraging the advantages of edge-conditioned convolutional networks and long short-term memory networks. [84] Multi-energy management Thermal energy balance Cum/Hard Shielding method (III-E) Decoupling architecture of safety constraint formulations from the RL formulation. Hard-constraint satisfaction without the need to solve a mathematical program. [85] Minimize the cost of electricity net, DG and gas Constraints of the power and gas networks Ins/Hard Safety layer (III-F) By learning a dynamic security assessment rule, a physically-informed safety layer ensures adherence to physical constraints by solving an action correction formulation. [14] Minimize the overall operation cost Branch power flow security constraint Ins/Soft Lagrangian relaxation (III-A) and RRL (III-H) An adversarial safe RL approach is proposed to enhance action safety and robustness against deviations between training and testing environments. [51] Minimize the operation cost Operational constraints Ins/Hard Safety layer (III-F), projection (III-B), and RRL (III-H) A safety layer that blends knowledge and data-driven approaches is created. Also, security constraints and linear projection are combined to improve computational speed.

•

Cum: Cumulative; Ins: Instantaneous; N.A.: Not applicable or not available.

Optimal power generation dispatch considering various constraints, ranging from simplified versions to security constraints, including economic dispatch, DC-OPF, AC-OPF, and SCOPF. The operation of a power system must meet both security and economic requirements. Considering credible contingencies, AC-OPF has been widely used [79, 86]. Most existing methods for solving OPF rely on analytical methods; however, given the inherently large scale of these problems, real-time computation is very challenging. A new variation of OPF is the SCOPF. This type of problem requires significantly longer computation times due to the additional security constraints [27]. To accelerate the calculation of SCOPF, methods such as DC-PF approximation [87], convex power flow approximation [88], and convex security constraint approximation [89] have been proposed. However, the accuracy of these methods has been questioned, and they remain time-consuming for large-scale systems. To accelerate computation and achieve better solutions, RL methods have been widely applied. Since traditional RL struggles to handle safety constraints effectively, safe RL has been further applied to address these issues.

The details of the applications of safe RL in optimal power generation dispatch are shown in Table II. Based on Table II, we summarize the foundational framework for implementing safe RL in optimal power generation dispatch with a specific example with SGs, RESs, and BESSs, incorporating strict physics-based constraints such as AC- and DC-PF constraints. If the system encompasses additional power system devices, the presented equations are designed to be readily scalable to accommodate them. Note that the models presented below are examples for illustration, and there are other RL formulations and models for optimal power generation dispatch depending on the specific problem setting. This is also true for other application domains. The state, action, reward, and constraints of optimal power generation dispatch are shown as follows.

IV-A1 AC-PF

AC-PF constraints describe the basic physics of power systems, which have been widely considered in optimal power generation dispatch, voltage control, unit commitments, etc.

State

The states include active and reactive loads and voltage:

\bm{s}^{\text{AC}}_{t}\triangleq\left(\bm{v}_{t},\bm{p}^{\text{Load}}_{t},\bm{% q}^{\text{Load}}_{t}\right)

(12)

Action

The control actions encompass both active and reactive power generation of SGs, active power generation of RESs, alongside power charging or discharging of BESSs:

\bm{a}^{\text{AC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{q}^{\text{SG}% }_{t},\bm{p}^{\text{RES}}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text% {BESS}}_{\text{dis},t}\right)

(13)

Reward

The reward includes SGs generation cost, wind curtailment cost, and BESSs cost:


$\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}$ Πぱい𝑆\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT ∈ roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT	$\displaystyle\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(\bm{% s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]$ τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ]	(14a)
$\displaystyle R^{\text{AC}}(\bm{s},\bm{a})$	$\displaystyle=-\left\|\sum_{\forall i\in\mathcal{G}}\left(a^{\text{SG}}_{i}(p^{% \text{SG}}_{i,t})^{2}+b^{\text{SG}}_{i}p^{\text{SG}}_{i,t}+c^{\text{SG}}_{i}% \right)\right\|$
	$\displaystyle\quad-\sum_{\forall i\in\mathcal{R}}c^{\text{RES}}_{i}\left\|p^{% \text{RES}}_{\text{MPPT},i,t}-p^{\text{RES}}_{i,t}\right\|$
	$\displaystyle\quad-\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}}_{\text{dis},i% }p^{\text{BESS}}_{\text{dis},i,t}+\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}% }_{\text{ch},i}p^{\text{BESS}}_{\text{ch},i,t}$	(14b)
$\displaystyle\bm{s}^{\text{AC}}_{t}$	$\displaystyle=f_{t}(\bm{s}^{\text{AC}}_{t-1},\bm{a}^{\text{AC}}_{t-1})~{}~{}~{% }\bm{a}^{\text{AC}}_{t}\sim\pi(\bm{a}^{\text{AC}}_{t}\|\bm{s}^{\text{AC}}_{t-1})$ πぱい ( bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT \| bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )	(14c)

Constraint

The control actions derived from DRL must adhere to physics-hard constraints. AC-PF constraints include bus active and reactive power balance constraints, SG active and reactive power generation constraints, RES active power generation constraints, voltage constraints, and branch apparent power constraints:


	$\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+$
	$\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\Re\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{% H}})\}$		(15a)
	$\displaystyle\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}-\bm{q}^{\text{Load}}% _{t}=\Im\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{H}% })\}$		(15b)
	$\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{q}}^{\text{SG}}\leq\bm{q}% ^{\text{SG}}_{t}\leq\overline{\bm{q}}^{\text{SG}}$		(15c)
	$\displaystyle\underline{\bm{p}}^{\text{RES}}\leq\bm{p}^{\text{RES}}_{t}\leq% \overline{\bm{p}}^{\text{RES}}~{}~{}~{}\underline{\bm{v}}\leq\|{\bm{v}}\|\leq% \overline{\bm{v}}~{}~{}~{}\|{s}_{ij}\|\leq\overline{s}_{ij}$		(15d)

where $\mathbf{M}^{\text{SG}}$ denotes the matrix $\{0,1\}^{N\times G}$ that maps the generation vector $\bm{p}_{t}^{\text{SG}}\in\mathbb{R}^{|{\cal G}|}$ to $\mathbb{R}^{N}$ :


	$\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=0~{}~{}~{}[% \mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=0,~{}~{}\forall i\in\mathcal% {N}\setminus\mathcal{G}$		(16a)
	$\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=p^{\text{SG}}_% {j}~{}~{}~{}[\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=q^{\text{SG}}_{% j},~{}~{}\forall i\in\mathcal{G},\forall j\in[G]$		(16b)

IV-A2 DC-PF

DC-PF constraints represent the linear relaxations of AC-PF, which are commonly included in optimal power generation dispatch and electricity market considerations.

State

The voltage and reactive power are overlooked in DC-PF.

\bm{s}^{\text{DC}}_{t}\triangleq\left(\bm{\vartheta}_{t},\bm{p}^{\text{Load}}_% {t}\right)

(17)

Action

The action involves only the generation or consumption of active power.

\bm{a}^{\text{DC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{p}^{\text{RES% }}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text{BESS}}_{\text{dis},t}\right)

(18)

Reward

The reward is similar with the AC-PF (14).

Constraint

The DC-PF constraints are a simplification of the AC-PF constraints, retaining only the active power components and disregarding voltage issues [90].


	$\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+$		(19a)
	$\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\mathbf{B}\bm{\vartheta}_{t}$
	$\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{p}}^{\text{RES}}\leq\bm{p% }^{\text{RES}}_{t}\leq\overline{\bm{p}}^{\text{RES}}$		(19b)
	$\displaystyle\|{p}_{ij}\|\leq\overline{p}_{ij}$		(19c)

IV-A3 BESS Constraints

The BESS constraints include charging and discharging constraints, and SoC constraints.


	$\displaystyle 0\leq\bm{p}^{\text{BESS}}_{\text{ch},t}\leq\overline{\bm{p}}^{% \text{BESS}}_{\text{ch}}~{}~{}~{}0\leq\bm{p}^{\text{BESS}}_{\text{dis},t}\leq% \overline{\bm{p}}^{\text{BESS}}_{\text{dis}}$		(20a)
	$\displaystyle\underline{\bm{SoC}}^{\text{BESS}}\leq\bm{SoC}^{\text{BESS}}_{t}% \leq\overline{\bm{SoC}}^{\text{BESS}}$		(20b)
	$\displaystyle\bm{SoC}^{\text{BESS}}_{t}=\bm{SoC}^{\text{BESS}}_{t-1}+\frac{% \Delta t}{E_{\text{cap}}^{\text{BESS}}}\Big{(}\eta^{\text{BESS}}_{\text{ch}}% \bm{p}^{\text{BESS}}_{\text{ch},t}-\frac{\bm{p}^{\text{BESS}}_{\text{dis},t}}{% \eta^{\text{BESS}}_{\text{dis}}}\Big{)}$ Δでるた𝑡superscriptsubscript𝐸capBESSsubscriptsuperscript𝜂BESSchsubscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑BESSdis𝑡subscriptsuperscript𝜂BESSdis\displaystyle\bm{SoC}^{\text{BESS}}_{t}=\bm{SoC}^{\text{BESS}}_{t-1}+\frac{% \Delta t}{E_{\text{cap}}^{\text{BESS}}}\Big{(}\eta^{\text{BESS}}_{\text{ch}}% \bm{p}^{\text{BESS}}_{\text{ch},t}-\frac{\bm{p}^{\text{BESS}}_{\text{dis},t}}{% \eta^{\text{BESS}}_{\text{dis}}}\Big{)}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG roman_Δでるた italic_t end_ARG start_ARG italic_E start_POSTSUBSCRIPT cap end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT end_ARG ( italic_ηいーた start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT - divide start_ARG bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ηいーた start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT end_ARG )		(20c)

IV-B Voltage Control

TABLE III: Safe RL Applications in Voltage Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [33] Minimize transmission losses Voltage and other system constraints Ins/Hard Projection layer (III-B) Through an embedded safe policy projection, it is possible to smoothly and effectively limit the action space, thereby preventing any breach of physical constraints. [40] Minimize cost Voltage constraint Ins/Hard Lyapunov stability (III-C) Ensuring that each NN controller satisfies certain Lipschitz constraints to inherently meet these constraints, thus guaranteeing the system maintains exponential stability. [91] Minimize transmission loss Voltage and power flow constraints Ins/Hard Finite iteration projection (III-B) A finite iteration projection algorithm is proposed to guarantee hard constraints by converting a non-convex optimization problem into a finite iteration problem. [52] Minimize the cost of network loss and device switching Voltage and power flow constraints Cum/Hard Safety layer (III-F) A safety layer is added to the policy NN to enhance operational constraint satisfaction for both initial exploration phase and convergence phase. [17] Minimize total network energy loss Voltage deviations Cum/Soft Primal-dual policy (III-A) Each zone has a central control agent that embeds GCNs to improve the decision-making capability. The primal-dual method is used to rigorously satisfy voltage safety constraints. [92] Minimize active power loss Voltage violations Cum/Soft Lagrangian relaxation (III-A) A MACSAC RL algorithm is proposed, which is utilized to train control agents online, eliminating the need for accurate ADN models. [47] Active voltage control SoC of BESSs Ins/Hard Physics-based shielding (III-E) The physics-shielded MATD3 algorithm is proposed, capable of replacing dangerous actions with safe ones as the BESSs approach dangerous SoC. [93] Minimize the ADN power losses and control efforts Voltage and power grid constraints Ins/Hard Safety layer (III-F) A safety layer is directly integrated on top of the DDPG actor network to forecasts changes in constrained states and prevents the violation of operational constraints in ADNs. [94] Minimize the network power loss Nodal voltage constraint Ins/Hard Safety projection (III-B) In the training stage, the safety projection is added to the combined policy to analytically solve an action correction formulation to achieve guaranteed 100% voltage security. [25] Minimize the cost of losses and the device switching Voltage constraint Ins/Soft Lagrangian relaxation (III-A) A safe off-policy DRL, Constrained SAC, is proposed to solve Volt-VAR control problems in a model-free manner. [95] Minimize the total control cost Voltage constraint Ins/Hard Safety projection layer (III-B) By leveraging the underlying grid information, a projection layer is designed to project the reactive power injection into a safe set of nodal voltage magnitudes. [41] Minimize the voltage deviation and control cost Voltage constraint Ins/Hard Lyapunov function (III-C) An explicitly constructed Lyapunov function is utilized to certify stability for all monotone policies without knowledge of the underlying model parameters. [96] Minimize the cost of electricity and BESSs maintenance Voltage constraint and ADN constraints Cum/Soft SAC with safety module A model-free DRL algorithm, integrated with a safety module, is proposed to minimize voltage violations and real power losses, with a design that guarantees no voltage violations occur during the online training. [35] Minimize the total operation costs Physical constraints Cum/Hard CPO (III-B) The voltage control problem is formulated as a CMDP and solved by TRPO and CPO to enable safe exploration. [16] Minimize voltage violations and network losses Voltage bound constraints Cum/Soft Penalty function and RRL (III-H) An adversarial RL algorithm has been developed to train an offline agent that is robust against model mismatches.

Voltage control is designed to ensure the magnitudes of voltage across power networks remain close to nominal values or within an acceptable range. For example, Fig. 9 shows the Volt/Var/Watt curves of voltage control [97]. Instead of directly controlling the active and reactive power injections of smart inverters, some researchers have proposed resetting the Volt/Var/Watt curves to control the voltage profiles [98, 99]. Increasing penetration levels of RESs, such as the large-scale deployment of wind farms in transmission systems and the widespread installation of distributed PVs and EVs in distribution networks, have led to significant changes in power system behavior. Due to the distribution networks typically being radial or distributed in structure and connecting a large number of intermittent and uncertain distributed RESs, voltage management has become more complex and challenging, often leading to voltage violations (either below 0.95 p.u. or above 1.05 p.u.) [100, 101]. Many current studies on voltage regulation utilize a physical model-based optimization/control method, employing convex relaxation techniques like second-order cone programming to simplify AC-PF constraints. This approach allows for efficient resolution using conventional solvers [33, 25, 102]. The application of Safe RL in the area of voltage control is detailed in Table III. According to Table III, we take the smarter inverters of DGs and BESSs as a prime example to summarize the voltage control problem associated with safe RL. The state, action, reward, and constraints of voltage control are shown as follows:

IV-B1 Volt/Var Control with AC-PF Constraints

State

The state variables are represented by PMU measurements, with sensors installed at buses denoted by $\mathcal{N}^{\text{PMU}}$ , or AMI measurements, with sensors installed at buses denoted by $\mathcal{N}^{\text{AMI}}$ . Thus, the state variable $\bm{s}$ is comprehensively defined by:


$\displaystyle\bm{s^{\text{PMU}}}$	$\displaystyle\triangleq\left((v_{i})_{i\in\mathcal{N}^{\text{PMU}}},(i_{i})_{i% \in\mathcal{N}^{\text{PMU}}}\right)$	(21a)
$\displaystyle\bm{s}^{\text{AMI}}$	$\displaystyle\triangleq\left(({\|v_{i}\|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},({% \|i_{i}\|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},(s_{ap,i})_{i\in\mathcal{N}^{% \text{AMI}}}\right)$	(21b)

The system dynamics that depict the environment can be formulated as

\displaystyle\bm{s}^{\text{V}}_{t+1}\triangleq\bm{f}(\bm{s}^{\text{V}}_{t},\bm% {a}^{\text{V}}_{t})

(22)

Action

The control actions include regulating the DGs, BESSs, and other components.

\bm{a}^{\text{V}}_{t}\triangleq\left(\bm{p}^{\text{DG}}_{t},\bm{q}^{\text{DG}}% _{t},\bm{p}^{\text{BESS}}_{t},\bm{p}^{\text{other}}_{t}\right)

(23)

Reward

The reward is to maintain the voltage magnitudes close to the nominal value $v_{\text{ref}}$ (typically 1.0 p.u.):

R^{\text{V}}(\bm{s},\bm{a})=-\|{\bm{v}_{t}-v_{\text{ref}}}\|

(24)

Another kind of reward design is a soft mechanism based on an acceptable range:

R^{\text{V}}(\bm{s},\bm{a})=-\sum_{i\in\cal N}\big{(}[{v}_{i}-\overline{v}]_{+% }+[\underline{v}-{v}_{i}]_{+}\big{)}

(25)

Constraint

The constraint for the active and reactive power injections of DGs is given by:

(\bm{p}^{\text{DG}})^{2}+(\bm{q}^{\text{DG}})^{2}\leq(\bar{\bm{s}}_{\text{ap}}% ^{\text{DG}})^{2}

(26)

However, [97] points out that the stability regions are more constrained than in Equation (26). For simplicity, we omit the specific equations. Figure 9 illustrates the piece-wise linear equations that constrain the battery system’s active and reactive power injections within the blue feasible region, while the solar panel inverters are only in the right region, as they do not have a discharging process, i.e., $p\geq 0$ .

IV-B2 Volt/Var Control with LinDistFlow Constraints

The LinDistFlow linearized branch flow model is applied within a tree-structured distribution network. The system consists of a set of nodes $\mathcal{N}_{+0}=\{0,1,\cdots,N\}$ and an edge set $\mathcal{E}$ . Node 0 is known as the substation, and $\mathcal{N}=\mathcal{N}_{+0}/\{0\}$ denotes the set of nodes excluding the substation node. Each node $i\in\mathcal{N}$ is associated with an active power injection $p_{i}$ and a reactive power injection $q_{i}$ . Let $V_{i}$ be the squared voltage magnitude, and let $p,q$ and $V$ denote $\{p_{i},q_{i},V_{i}\}_{i\in\mathcal{N}}$ stacked into a vector. The variables satisfy the following equations, $\forall i\in\mathcal{N}$ ,


$\displaystyle p_{i}$	$\displaystyle=-p_{ji}+\sum_{k:(i,k)\in\mathcal{E}}p_{ik}$	(27a)
$\displaystyle q_{i}$	$\displaystyle=-q_{ji}+\sum_{k:(i,k)\in\mathcal{E}}q_{ik}$	(27b)
$\displaystyle v_{i}$	$\displaystyle=v_{j}-2(r_{ij}p_{ji}+x_{ji}q_{ji})$	(27c)

where $j$ is the parent node of $i$ in the distribution network. (27c) can be written in the vector form:

\bm{v}=\mathbf{r}\bm{p}+\mathbf{x}\bm{q}+v_{0}\mathbf{1}=\mathbf{x}\bm{q}+\bm{% v}_{\text{env}}

(28)

where $\bm{v}_{\text{env}}=\mathbf{r}\bm{p}+v_{0}\mathbf{1}$ represents the component that cannot be controlled; $\mathbf{r}=[2r_{ij}]^{N\times N}$ and $\mathbf{x}=[2x_{ij}]^{N\times N}$ are matrices defined correspond to the parameters $r_{ij}$ and $x_{ij}$ , respectively.

State

The state of LinDistFlow is also determined by PMU and AMI measurements, similar to the AC-PF (21).

Action

The control actions is a mapping from the voltage to reactive power, which is defined by:

\bm{a}^{\text{V}}_{t}=\Delta\bm{q}_{t}\triangleq\bm{q}_{t}-\bm{q}_{t+1}

Δでるたsubscript𝒒𝑡≜subscript𝒒𝑡subscript𝒒𝑡1\bm{a}^{\text{V}}_{t}=\Delta\bm{q}_{t}\triangleq\bm{q}_{t}-\bm{q}_{t+1}bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δでるた bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT

(29)

The system dynamics can be given as

\bm{v}_{t+1}=\mathbf{r}\bm{p}+\mathbf{x}(\bm{q}_{t}-\bm{a}^{\text{V}}_{t})+v_{% 0}\mathbf{1}

(30)

where $\bm{p}$ lacks a time subscript because it pertains to a fast-response control mechanism, and the active power injection is assumed to be constant.

Reward

The reward is also designed to keep the voltage close to its nominal value (24) or within its maximum and minimum limits (25).

Constraint

The constraints include maximum and minimum value limits and the stability of the action:


	$\displaystyle~{}\underline{\bm{a}}^{\text{V}}\leq\bm{a}^{\text{V}}_{t}\leq% \overline{\bm{a}}^{\text{V}}$		(31a)
	$\displaystyle\bm{a}^{\text{V}}_{t}~{}\text{is stabilizing}$		(31b)

IV-B3 Safe RL for Voltage Control

In recent years, the integration of DERs such as rooftop solar panels and EVs has led to rapid and unpredictable fluctuations in the generation and load profiles of distribution systems. These fluctuations pose significant challenges in real-time voltage control for distribution grids. Recently, RL has emerged as a powerful approach for addressing model-free nonlinear control problems, generating considerable interest in developing RL-based controllers to optimize the transient performance of voltage control problems. Safe RL has been effectively implemented to ensure adherence to voltage and transient stability constraints.

In the future, the focus is shifting toward distributed voltage regulation, driven by the limitations of centralized voltage regulation, which requires a central controller and is susceptible to single-point failures and significant communication burdens. Consequently, distributed voltage regulation, which only requires the exchange of local information with neighboring units, has attracted considerable research interest as a promising direction for future development [17].

IV-C Stability Control

TABLE IV: Safe RL Applications in Stability Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [56] Preventive control for transmission overload relief Safety, generation, and network constraints Cum/Hard IPO (III-G) The IPO method’s efficacy is boosted by leveraging spatial-temporal correlations in power grid nodal and edge features. [57] Emergency control for under voltage load-shedding Transient voltage stability Cum/Hard Barrier function (III-G) The safe RL method employs a reward function with a time-dependent barrier function that approaches negative infinity as the system state nears the safety bounds. [103] Emergency load-shedding control Rated capacity, current, voltage and others Cum/Soft Lagrangian relaxation (III-A) Two DRL strategies are designed to tackle intricate power system control challenges in a data-driven manner, aiming to preserve power system stability. [104] Transient and steady-state voltage control Reactive power capacity constraints Ins/Hard Lagrangian relaxation (III-A) and barrier function (III-G) Based on the safe gradient flow framework, the design employs a control barrier function to ensure that given dynamics never leave a safe set. [105] Frequency control Operational constraints Cum/Soft Safety model (III-F) A safety model is proposed comprising two parts: one to check if actions meet safety standards, and another to suggest new actions if they don’t. [106] Minimize the control cost Frequency limit Cum/Hard Barrier function (III-G) A novel self-tuning control barrier function is designed to actively compensate the unsafe frequency control strategies under variational safety constraints. [107] Primary frequency control Frequency constraint Ins/Hard Gauge map (III-F) A closed-form gauge map is proposed, which maps NN outputs from unsafe actions to the set of safe actions.