(Translated by https://www.hiragana.jp/)
A Review of Safe Reinforcement Learning Methods for Modern Power Systems

A Review of Safe Reinforcement Learning Methods for Modern Power Systems

Tong Su, , Tong Wu, , Junbo Zhao, ,
Anna Scaglione, , Le Xie
This work is supported by the U.S. Department of Energy Solar Energy Technologies Office under award 37770. Tong Su and Junbo Zhao are with the Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT 06269, USA (e-mail: tongsu@uconn.edu; junbo@uconn.edu). Tong Wu and Anna Scaglione are with the Department of Electrical and Computer Engineering, Cornell Tech, Cornell University, New York City, NY 10044, USA (e-mail: tw385@cornell.edu; as337@cornell.edu). Le Xie is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA (e-mail: le.xie@tamu.edu).
Abstract

Due to the availability of more comprehensive measurement data in modern power systems, there has been significant interest in developing and applying reinforcement learning (RL) methods for operation and control. Conventional RL training is based on trial-and-error and reward feedback interaction with either a model-based simulated environment or a data-driven and model-free simulation environment. These methods often lead to the exploration of actions in unsafe regions of operation and, after training, the execution of unsafe actions when the RL policies are deployed in real power systems. A large body of literature has proposed safe RL strategies to prevent unsafe training policies. In power systems, safe RL represents a class of RL algorithms that can ensure or promote the safety of power system operations by executing safe actions while optimizing the objective function. While different papers handle the safety constraints differently, the overarching goal of safe RL methods is to determine how to train policies to satisfy safety constraints while maximizing rewards. This paper provides a comprehensive review of safe RL techniques and their applications in different power system operations and control, including optimal power generation dispatch, voltage control, stability control, electric vehicle (EV) charging control, buildings’ energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Additionally, the paper discusses benchmarks, challenges, and future directions for safe RL research in power systems.

Index Terms:
Safe reinforcement learning, machine learning, power system operation, power system control, energy management, optimal power generation dispatch, EV charging, voltage control.

Nomenclature

Notations

γがんま𝛾\gammaitalic_γがんま

Discount factor γがんま[0,1)𝛾01\gamma\in[0,1)italic_γがんま ∈ [ 0 , 1 )

ΔでるたΔでるた\Deltaroman_Δでるた

Difference operator

δでるた𝛿\deltaitalic_δでるた

Rotor angle

ϵ/Aitalic-ϵ𝐴\epsilon/Aitalic_ϵ / italic_A

Inertia parameter of temperature and thermal conductivity of HVAC

εいぷしろん𝜀\varepsilonitalic_εいぷしろん

Safety constraint bound

ζぜーた𝜁\zetaitalic_ζぜーた

Safety probability (1ζぜーた1𝜁1-\zeta1 - italic_ζぜーた is the the risk probability)

ηいーた,ηいーたp/hCHP𝜂subscriptsuperscript𝜂CHP𝑝\eta,\eta^{\text{CHP}}_{p/h}italic_ηいーた , italic_ηいーた start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p / italic_h end_POSTSUBSCRIPT

Efficiency of charging or discharging, electrical/thermal energy efficiency of CHP

θしーた𝜃\thetaitalic_θしーた

Parameters of the policy πぱいθしーたsubscript𝜋𝜃\pi_{\theta}italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT

ϑitalic-ϑ\varthetaitalic_ϑ

Grid state in the DC-PF approximation

𝚲ch/disEVsubscriptsuperscript𝚲EVch/dis\bm{\Lambda}^{\text{EV}}_{\text{ch/dis}}bold_Λらむだ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch/dis end_POSTSUBSCRIPT

Charging/selling electricity price of EV

𝚲Ele/Gas/Carsuperscript𝚲Ele/Gas/Car\bm{\Lambda}^{\text{Ele/Gas/Car}}bold_Λらむだ start_POSTSUPERSCRIPT Ele/Gas/Car end_POSTSUPERSCRIPT

Price of electricity/gas/carbon

λらむだ𝜆\lambdaitalic_λらむだ

Penalty coefficient or Lagrange multiplier

ΠぱいSsubscriptΠぱい𝑆\Pi_{S}roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT

Policy set

πぱいθしーたsubscript𝜋𝜃\pi_{\theta}italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT, πぱいθしーたadvsuperscriptsubscript𝜋𝜃adv\pi_{\theta}^{\text{adv}}italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT

Parameterized policy, policy of adversary

πぱいθしーたksuperscriptsubscript𝜋𝜃𝑘\pi_{\theta}^{k}italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, πぱいθしーたk+12superscriptsubscript𝜋𝜃𝑘12\pi_{\theta}^{k+\frac{1}{2}}italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

Policy at iteration k𝑘kitalic_k, intermediate policy between iterations k𝑘kitalic_k and k+1𝑘1k+1italic_k + 1

ρろー0subscript𝜌0\rho_{0}italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

ρろー0:𝒮[0,1]:subscript𝜌0𝒮01\rho_{0}:\mathcal{S}\rightarrow[0,1]italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : caligraphic_S → [ 0 , 1 ] is starting state distribution of 𝒮𝒮\mathcal{S}caligraphic_S

τたう𝜏\tauitalic_τたう

Trajectory τたう=(s0,a0,s1,)𝜏subscript𝑠0subscript𝑎0subscript𝑠1\tau=(s_{0},a_{0},s_{1},\ldots)italic_τたう = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … )

𝝎𝝎\bm{\omega}bold_italic_ωおめが

Frequency

𝒜,𝒂𝒜𝒂\mathcal{A},\bm{a}caligraphic_A , bold_italic_a

Action set, action

aSG/bSG/cSGsuperscript𝑎SGsuperscript𝑏SGsuperscript𝑐SGa^{\text{SG}}/b^{\text{SG}}/c^{\text{SG}}italic_a start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT / italic_b start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT

Fuel cost coefficients of SG

/𝒢/𝒢\mathcal{B}/\mathcal{G}/\mathcal{R}caligraphic_B / caligraphic_G / caligraphic_R

BESS/SG/RES set

𝒞,C𝒞𝐶\mathcal{C},Ccaligraphic_C , italic_C

Constraint set 𝒞={(Ci,εいぷしろんi)}i=1m𝒞subscriptsuperscriptsubscript𝐶𝑖subscript𝜀𝑖𝑚𝑖1\mathcal{C}=\{(C_{i},\varepsilon_{i})\}^{m}_{i=1}caligraphic_C = { ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, constraint cost function C:𝒮×𝒜×𝒮R:𝐶𝒮𝒜𝒮RC:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\textbf{R}italic_C : caligraphic_S × caligraphic_A × caligraphic_S → R

cRES/BESSsuperscript𝑐RES/BESSc^{\text{RES/BESS}}italic_c start_POSTSUPERSCRIPT RES/BESS end_POSTSUPERSCRIPT

Cost coefficients of RES/BESS

ch/dischdis\text{ch}/\text{dis}ch / dis

Charging/discharging of electricity or thermal for ESS

𝔻𝔻\mathbb{D}blackboard_D

Function to extract the vector of diagonal elements from a matrix

M,L,1R,D𝑀𝐿1𝑅𝐷M,L,\frac{1}{R},Ditalic_M , italic_L , divide start_ARG 1 end_ARG start_ARG italic_R end_ARG , italic_D

Inertia constant, load damping coefficient, speed droop response coefficient, D=1R+L𝐷1𝑅𝐿D=\frac{1}{R}+Litalic_D = divide start_ARG 1 end_ARG start_ARG italic_R end_ARG + italic_L is the combined frequency response coefficient from synchronous generators and load

𝔼,E,Ecap𝔼𝐸subscript𝐸cap\mathbb{E},E,E_{\text{cap}}blackboard_E , italic_E , italic_E start_POSTSUBSCRIPT cap end_POSTSUBSCRIPT

Expectation function, energy associated with devices, energy capacity of ESS

/𝒩𝒩\mathcal{E}/\mathcal{N}caligraphic_E / caligraphic_N

Edge/node set

f,g,h𝑓𝑔f,g,hitalic_f , italic_g , italic_h

State transition dynamics or the model of the environment, equality constraints with a total number of m𝑚mitalic_m, inequality constraints with a total number of n𝑛nitalic_n.

G/N𝐺𝑁G/Nitalic_G / italic_N

Cardinality of the set 𝒢/𝒩𝒢𝒩\cal G/\cal Ncaligraphic_G / caligraphic_N

𝒈𝒈\bm{g}bold_italic_g

Gas input of CHP or GB

/\mathcal{H}/*caligraphic_H / ∗

Hermitian/conjugate for a vector or matrix

𝒉𝒉\bm{h}bold_italic_h

Thermal energy generation or load vector

𝒊𝒊\bm{i}bold_italic_i

Current phasor vector

𝒥Rπぱいθしーたsuperscriptsubscript𝒥𝑅subscript𝜋𝜃\mathcal{J}_{R}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒥hiπぱいθしーたsuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃\mathcal{J}_{h_{i}}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

Reward performance, constraint cost performance of inequality constraints

\mathcal{L}caligraphic_L

Lagrangian

\mathcal{M}caligraphic_M, Csubscript𝐶\mathcal{M}_{C}caligraphic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT

MDP =(𝒮,𝒜,𝒫,r,ρろー0,γがんま)𝒮𝒜𝒫𝑟subscript𝜌0𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,\rho_{0},\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_P , italic_r , italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま ), CMDP C=(𝒮,𝒜,𝒫,R,ρろー0,γがんま,𝒞)subscript𝐶𝒮𝒜𝒫𝑅subscript𝜌0𝛾𝒞\mathcal{M}_{C}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})caligraphic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = ( caligraphic_S , caligraphic_A , caligraphic_P , italic_R , italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま , caligraphic_C )

,𝒫𝒫\mathbb{P},\mathcal{P}blackboard_P , caligraphic_P

Probability function, 𝒫:𝒮×𝒜×𝒮[0,1]:𝒫𝒮𝒜𝒮01\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]caligraphic_P : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ] is the transition matrix, where 𝒫(st+1|st,at)𝒫conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡\mathcal{P}(s_{t+1}|s_{t},a_{t})caligraphic_P ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the probability of state transition from stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT after taking action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Phis/preLoadsubscriptsuperscript𝑃Loadhis/preP^{\text{Load}}_{\text{his/pre}}italic_P start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT his/pre end_POSTSUBSCRIPT

Historical/current net load forecast

Pressubscript𝑃resP_{\text{res}}italic_P start_POSTSUBSCRIPT res end_POSTSUBSCRIPT

Reserve requirement

𝒑/𝒒𝒑𝒒\bm{p}/\bm{q}bold_italic_p / bold_italic_q

Active/reactive power generation or load vector

𝒑¯eGensubscriptsuperscript¯𝒑Gen𝑒\overline{\bm{p}}^{\text{Gen}}_{e}over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT

Maximum emergency power generation of generator

𝒑Bussuperscript𝒑Bus\bm{p}^{\text{Bus}}bold_italic_p start_POSTSUPERSCRIPT Bus end_POSTSUPERSCRIPT

Bus power injection

pij/qij/sijsubscript𝑝𝑖𝑗subscript𝑞𝑖𝑗subscript𝑠𝑖𝑗p_{ij}/q_{ij}/s_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT

Active/reactive/apparent power for branch ij𝑖𝑗ijitalic_i italic_j

𝒑e/𝒑msubscript𝒑𝑒subscript𝒑𝑚\bm{p}_{e}/\bm{p}_{m}bold_italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT / bold_italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

Electrical/mechanical power

R𝑅Ritalic_R

Reward function R:𝒮×𝒜×𝒮:𝑅𝒮𝒜𝒮R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}italic_R : caligraphic_S × caligraphic_A × caligraphic_S → blackboard_R

𝑹up/downsubscript𝑹up/down\bm{R}_{\text{up/down}}bold_italic_R start_POSTSUBSCRIPT up/down end_POSTSUBSCRIPT

Ramp-up/down rate of generators

rij/xijsubscript𝑟𝑖𝑗subscript𝑥𝑖𝑗r_{ij}/x_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT

Resistance/reactance of line ij𝑖𝑗ijitalic_i italic_j

𝒮,𝒔ap,𝒔𝒮subscript𝒔ap𝒔\mathcal{S},\bm{s}_{\text{ap}},\bm{s}caligraphic_S , bold_italic_s start_POSTSUBSCRIPT ap end_POSTSUBSCRIPT , bold_italic_s

State set, apparent power vector, state

𝑺up/downsubscript𝑺up/down\bm{S}_{\text{up/down}}bold_italic_S start_POSTSUBSCRIPT up/down end_POSTSUBSCRIPT

Start-up/shut-down rate of generators

𝒯,t𝒯𝑡\mathcal{T},tcaligraphic_T , italic_t

Time step set of trajectory τたう𝜏\tauitalic_τたう, time instant

t¯up/t¯up,ttotsubscript¯𝑡upsubscript¯𝑡upsubscript𝑡tot\overline{t}_{\text{up}}/\underline{t}_{\text{up}},t_{\text{tot}}over¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up end_POSTSUBSCRIPT / under¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT

Maximum/minimum up time of Gens, total time

T,H,TI/O𝑇𝐻superscript𝑇𝐼𝑂T,H,T^{I/O}italic_T , italic_H , italic_T start_POSTSUPERSCRIPT italic_I / italic_O end_POSTSUPERSCRIPT

Temperature, humidity, indoor/outdoor temperature

𝒖start/shut/comsubscript𝒖start/shut/com\bm{u}_{\text{start/shut/com}}bold_italic_u start_POSTSUBSCRIPT start/shut/com end_POSTSUBSCRIPT

Startup/shutdown/commitment status of Gens

𝒗/ϕ𝒗bold-italic-ϕ\bm{v}/\bm{\phi}bold_italic_v / bold_italic_ϕ

Voltage phasor/phase vector 𝒗t=|𝒗|e𝔧ϕsubscript𝒗𝑡direct-product𝒗superscript𝑒𝔧bold-italic-ϕ\bm{v}_{t}=|\bm{v}|\odot e^{\mathfrak{j}\bm{\phi}}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = | bold_italic_v | ⊙ italic_e start_POSTSUPERSCRIPT fraktur_j bold_italic_ϕ end_POSTSUPERSCRIPT,

𝐘/𝐁𝐘𝐁\mathbf{Y}/\mathbf{B}bold_Y / bold_B

Admittance/susceptance matrix

¯/¯¯absent¯absent\overline{\ }/\underline{\ }over¯ start_ARG end_ARG / under¯ start_ARG end_ARG

Maximum/minimum values of the variable or vector

Abbreviations

AC/DC

Alternating current/direct current

ADN

Active Distribution Network

AMI

Advanced Metering Infrastructure

(B/M/T)ESS

(Battery/Mobile/Thermal) Energy Storage System

CHP

Combined Heat and Power system

(C)MDP

(Constrained) Markov Decision Process

CPO

Constrained Policy Optimization

CPPO

Constraint-controlled PPO

CS

Charging Station

CUP

Conservative Update Policy

DDPG

Deep Deterministic Policy Gradient

DG

Distributed Generation

DER

Distributed Energy Resource

(D/R)NN

(Deep/Recurrent) Neural Network

DSO

Distribution System Operator

(D/R)RL

(Deep/Robust) Reinforcement Learning

EHP

Electric Heat Pump

EV

Electric Vehicle

FACTS

Flexible AC Transmission System

FOCOPS

First Order Constrained Optimization in Policy Space

GCN

Graph Convolution Network

GB

Gas Boiler

Gen

Generator

GP

Gaussian Process

GPT

Generative Pre-trained Transformer

HVAC

Heating, Ventilation and Air-Conditioning

ICNN

Input Convex Neural Network

IPO

Interior-point Policy Optimization

Lag

Lagrangian methods

LLM

Large Language Model

MA(C)

Multi-Agent (Constrained)

MIP

Mixed-Integer Linear

MPPT

Maximum Power Point Tracking

PCPO

Projection-based Constrained Policy Optimization

PDO

Primal-Dual Optimization

PILCO

Probabilistic Inference for Learning Control

PMU

Phasor Measurement Unit

PPO

Proximal Policy Optimization

p.u.

per unit

RES

Renewable Energy Source

RCPO

Reward Constrained Policy Optimization

SAC

Soft Actor-Critic

SafePO

Safe Policy Optimization

(SC)(O)PF

(Security Constrained) (Optimal) Power Flow

SG

Synchronous Generator

SoC

State of Change

TD3

Twin-Delayed Deep Deterministic policy gradient

TL

Thermal Load (such as room heater and water heater)

TR(PO/M)

Trust Region (Policy Optimization/Method)

V2G

Vehicle-to-Grid

V, F

Voltage, Frequency

I Introduction

With the extensive integration of RESs, ESSs, and advanced power electronic devices, modern power systems are facing increased uncertainty and complexity, which translate to higher computational burden when modeling the stochastic non-linear nature of the control and decision problems. However, thanks to the widespread deployment of smart sensors, such as PMUs, along with advanced communication technologies, a vast amount of power system data can be measured and utilized for state estimation and control. As a result, data-driven approaches like RL have emerged as the key candidates for the numerical optimization of power systems decision and/or control policies[1], which would be otherwise intractable to derive. Conventionally, RL training is based on trial-and-error and reward feedback interaction with a model-based simulated environment [2] or a data-driven model-free simulated environment [3]. Recently, DRL, which embeds NNs as the policy function, has proven expressive enough to solve complicated control tasks. Additionally, the NN approach is used to reduce computation costs for online implementation. Once the NNs are trained, they approximate closed-form solutions and produce results quickly. However, nothing prevents the exploration of unsafe ranges during training and the execution of unsafe actions when the trained policies are deployed in real power systems. Therefore, the practical application of RL policies cannot be based on vanilla RL training [4].

In 2015, safe RL was first defined as “the process of learning policies that maximize the expectation of the reward in problems, where it is crucial to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes” [5]. Concurrently, the safe RL literature has been paid increasing attention. The methods can be coarsely divided into two categories: in one category the authors proposed to add to the reward function a safety factor that penalizes safety violations, and in the other category in the training phase the exploration process has been modified incorporating mechanisms that yield safe policies[5]. Based on these two approaches, numerous safe RL methods have been proposed and many have been applied and tailored for solving power systems decision and control problems, such as energy management, optimal power generation dispatch, EV Charging, voltage control, and others that this paper will cover in Section IV.

Reference [6] is currently the only paper that provides an overview of safe RL applications. However, the field is fast evolving and we aim to provide, first a comprehensive review of various safe RL techniques in general, and then a deep dive of their applications in power systems. The main contributions of the paper are as follows:

  1. 1.

    This paper provides a comprehensive review of safe RL, covering its fundamental concepts, constraint classifications, existing algorithms, and benchmarks. It details the unique features and limitations of each RL algorithm, providing a foundation for future research endeavors in the domain of safe RL.

  2. 2.

    Comprehensive review of the application of safe RL in power systems follows, covering almost all existing papers in this area. It categorizes these papers based on their application domains, listing each paper’s objectives, constraints, implemented safe RL techniques, environment types, and key features.

  3. 3.

    We explore the key challenges and future research opportunities in safe RL for applications within power systems.

The framework of this paper is shown in Fig. 1. The rest of the paper is organized as follows. Section II introduces the CMDP and constraints. Section III provides a detailed introduction and classification of safe RL. Section IV offers a comprehensive review and comparative analysis of safe RL applications in different fields within the power system. Challenges and outlook are discussed in Section V and finally, Section VI concludes the paper.

Refer to caption
Figure 1: The framework of safe RL in power system application.

II Constrained Markov Decision Process

II-A Problem formulation

MDPs are defined by a tuple =(𝒮,𝒜,𝒫,R,ρろー0,γがんま)𝒮𝒜𝒫𝑅subscript𝜌0𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_P , italic_R , italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま ) which are, respectively, the state space, action space, probability distribution, reward function, initial state ρろー0𝒮subscript𝜌0𝒮\rho_{0}\in{\cal S}italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S and discount factor. When the decision problem fits in an MDP, the objective is to determine the policy πぱい𝜋\piitalic_πぱい that maximizes the expected discounted reward 𝒥Rπぱいθしーたsuperscriptsubscript𝒥𝑅subscript𝜋𝜃\mathcal{J}_{R}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, i.e.[4, 7, 8]:

𝒥Rπぱいθしーた=𝔼τたうπぱい[t=0γがんまtR(𝒔t,𝒂t,𝒔t+1)]superscriptsubscript𝒥𝑅subscript𝜋𝜃subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1\mathcal{J}_{R}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}R(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] (1)

where τたうπぱいsimilar-to𝜏𝜋\tau\sim\piitalic_τたう ∼ italic_πぱい indicates that the distribution over trajectories depends on the policy πぱい𝜋\piitalic_πぱい; similarly 𝒔0ρろー0similar-tosubscript𝒔0subscript𝜌0\bm{s}_{0}\sim\rho_{0}bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝒂tπぱい(|𝒔t)\bm{a}_{t}\sim\pi(\cdot|\bm{s}_{t})bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_πぱい ( ⋅ | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), 𝒔t+1𝒫(|𝒔t,𝒂t)\bm{s}_{t+1}\sim\mathcal{P}(\cdot|\bm{s}_{t},\bm{a}_{t})bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_P ( ⋅ | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Even if the transition probabilities and reward function are fully known, this task is often intractable. However, the approach taken normally is to learn the policy, using some parametrization.

The CMDP C=(𝒮,𝒜t,𝒫,R,ρろー0,γがんま,𝒞)subscript𝐶𝒮subscript𝒜𝑡𝒫𝑅subscript𝜌0𝛾𝒞\mathcal{M}_{C}=(\mathcal{S},\mathcal{A}_{t},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})caligraphic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = ( caligraphic_S , caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_P , italic_R , italic_ρろー start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま , caligraphic_C ) is an extension of a standard MDP, that addresses a frequent model variation: the case in which the action space 𝒜tsubscript𝒜𝑡{\cal A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a function of the state space 𝒮𝒮{\cal S}caligraphic_S, i.e. 𝒔t𝒜tmaps-tosubscript𝒔𝑡subscript𝒜𝑡\bm{s}_{t}\mapsto{\cal A}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↦ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, because the change in the environment affects what is a safe or feasible action, or due to the state-dependent cost of the action, which in the formulation needs to be below a threshold. This occurs in physical systems in which the boundary conditions, the state and the laws of physics limit what is feasible, what would lead to operations that are unsafe and how expensive is a certain agent action. In a nutshell, what differentiates the various instances of CMDP from a conventional MDP is the class of constraints that characterize the action space as a function of the system dynamics and the specific engineering problem and context that define the constraints. In this review, we define the CMDP for power system problems:

maxπぱいθしーたΠぱいS𝒥Rπぱいθしーたsubscriptsubscript𝜋𝜃subscriptΠぱい𝑆superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT ∈ roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (2)
s.t. 𝒂t is feasiblesubscript𝒂𝑡 is feasible\displaystyle~{}~{}\bm{a}_{t}\text{ is feasible }bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is feasible

where 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is feasible not only means that 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is constrained within its upper and lower limits, but also that the resulting 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT falls within specified feasible sets. In power systems, constraints on the upper and lower bounds of 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT relate to the control ranges of various controllable devices, such as the power output of SGs, RESs, and ESSs, as well as the temperature setpoint of HVAC systems, which can typically be enforced by simply restricting the action space of RL. 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT falls within specified feasible sets means that the state adheres to safe and stable operation constraints, such as boundary constraints of voltages, line flows, and building temperatures, as well as stability constraints of voltages, frequency, and rotor angles. Due to the highly non-linear and non-convex nature of power systems, obtaining feasible 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that guarantees feasible 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is challenging. This is also the main challenge of training safe RL.

II-B Constraints

II-B1 Instantaneous Constraints

Instantaneous constraints are prevalent in power systems. For instance, in the optimal power generation dispatch of power systems, we encounter constraints such as power flow, dynamic limitations associated with BESSs, voltage magnitude bounds, and power generation limits, as detailed in Section IV-A. Another instance is voltage control, which incorporates additional voltage droop control dynamics and stability constraints, described in Section IV-B. We also explore other examples such as stability control, EV charging control, and building energy management in Section IV. In general, these constraints can be expressed as follows:

maxπぱいθしーたΠぱいS𝒥Rπぱいθしーたsubscriptsubscript𝜋𝜃subscriptΠぱい𝑆superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT ∈ roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (3)
s.t.gj(𝒔t,𝒂t,\displaystyle\text{s.t.}~{}~{}g_{j}(\bm{s}_{t},\bm{a}_{t},s.t. italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 𝒔t+1)=0,j=1,,m\displaystyle\bm{s}_{t+1})=0,~{}~{}j=1,\cdots,mbold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = 0 , italic_j = 1 , ⋯ , italic_m
hk(𝒔t,𝒂t,\displaystyle~{}h_{k}(\bm{s}_{t},\bm{a}_{t},italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 𝒔t+1)0,k=1,,n\displaystyle\bm{s}_{t+1})\leq 0,~{}~{}k=1,\cdots,nbold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ 0 , italic_k = 1 , ⋯ , italic_n

where the control action must fulfill both the m𝑚mitalic_m equality and n𝑛nitalic_n inequality constraints. We incorporate the terms 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒔t+1subscript𝒔𝑡1\bm{s}_{t+1}bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT within these constraints to represent the time-varying bounds of 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Additionally, the dynamical constraints are also integrated into the aforementioned constraints.

II-B2 Cumulative Constraints

Cumulative constraints mandate that the sum or average of a specific cost signal remains within prescribed limits, calculated from the beginning of an event to the present time. Examples include total revenue and network throughput. These constraints are commonly applied in robot locomotion and manipulation, as discussed in [9]. Although several studies have attempted to adapt these constraints to power systems as a more flexible alternative to hard constraints, the application remains limited. For instance, [10] employs a discounted cumulative formulation in (4) to establish safety constraints in the management of distribution networks. In particular, they relax instantaneous constraints, such as voltage bounds, SoC bounds, and power quality, to a discounted cumulative formulation. Similarly, [11, 12] also utilize this approach. However, such constraints may not fully capture all safety requirements, though they do offer a partial enhancement of safety measures, providing some benefit over no constraints at all. The reason these studies do not consider instantaneous constraints is that cumulative relaxation offers a straightforward method to adapt constrained RL techniques, originally developed for robot locomotion and manipulation, to power systems. This approach not only simplifies implementation but also provides methodological insights that could potentially be extended to handle instantaneous constraints in future research.

To make the review more self-contained, we will review three kinds of cumulative constraints. In [13], the constraints for safe RL are divided into cumulative constraints and instantaneous constraints. For cumulative constraints, they are further categorized as discounted cumulative constraints (4), mean valued constraints (5), and probabilistic constraints (6). The discounted cumulative constraint is of the form:

𝒥hiπぱいθしーた=𝔼τたうπぱい[t=0γがんまthi(𝒔t,𝒂t,𝒔t+1)]εいぷしろんisuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]\leq% \varepsilon_{i}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ≤ italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (4)

where εいぷしろんisubscript𝜀𝑖\varepsilon_{i}italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the limit for each cumulative constraint.

The mean valued constraint is of the form:

𝒥hiπぱいθしーた=𝔼τたうπぱい[1ttott=0ttot1hi(𝒔t,𝒂t,𝒔t+1)]εいぷしろんisuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃subscript𝔼similar-to𝜏𝜋delimited-[]1subscript𝑡totsuperscriptsubscript𝑡0subscript𝑡tot1subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\frac{1}{t_{% \text{tot}}}\sum_{t=0}^{t_{\text{tot}}-1}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t% +1})\right]\leq\varepsilon_{i}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ≤ italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (5)

where ttotsubscript𝑡tott_{\text{tot}}italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT is the total number of time steps in each trajectory.

The second group concerns the probability that the cumulative costs violate a constraint [13]. Probabilistic constraints are of the form:

𝒥hiπぱいθしーた=[thi(𝒔t,𝒂t,𝒔t+1)εいぷしろんi]ζぜーたsuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃delimited-[]subscript𝑡subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖𝜁\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{P}\left[\sum_{t}h_{i}(\bm{s}_{t},% \bm{a}_{t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zetacaligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_P [ ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≥ italic_ζぜーた (6)

where ηいーたisubscript𝜂𝑖\eta_{i}italic_ηいーた start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the cumulative cost threshold for each trajectory and εいぷしろんi(0,1)subscript𝜀𝑖01\varepsilon_{i}\in(0,1)italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is the probability limit.

Here, it is important to emphasize again that in power systems, the majority of constraints must be satisfied at every instant, thus they are commonly implemented as instantaneous constraints. For example, [14] utilizes the expected discounted reward, whereas constraints related to branch power flow and security operations are treated as instantaneous constraints.

II-C Constraints in Power Systems: Overview

In power system applications, the classification of constraints into instantaneous and cumulative constraints is related to the required degree of constraint satisfaction and the safe RL algorithms used. Typically, bus balance equations, upper and lower power limits of various equipment, ESS capacity constraints, certain voltage amplitude constraints, and some stability constraints are considered hard constraints. Safe RL algorithms capable of ensuring the satisfaction of hard constraints include projection method III-B, Lyapunov method III-C, shielding method III-E, safety layer method III-F and barrier function method III-G. For example, [15] uses the logarithmic barrier function to make the 𝒥hiπぱいθしーたsuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃\mathcal{J}_{h_{i}}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT approach infinity when voltage exceeds bounds, thereby satisfying hard voltage constraints. Due to discrepancies between models and real systems, various uncertainties of RESs and loads, and algorithmic shortcomings, even if constraints are theoretically satisfied, they may not be guaranteed in actual deployment. Therefore, GP methods III-D and RRL III-G have been proposed, using the probabilistic/chance constraint (6). However, their application in power systems remains underexplored. A more common approach is to use constrained game-theoretic RL within RRL [14, 16]. Furthermore, by design some safe RL algorithms can only encourage constraint satisfaction while maximizing rewards. Such algorithms include Lagrangian relaxation III-A and penalty functions. For example, [17] uses the voltage constraint metric 𝒥hiπぱいθしーた=i𝒩max{|𝒗i,t1|0.05|,0}\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\sum_{i\in\cal N}\max\left\{|\bm{v}_{i,t}-1% |-0.05|,0\right\}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT roman_max { | bold_italic_v start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - 1 | - 0.05 | , 0 } and employs Lagrangian relaxation for voltage control, which cannot guarantee absolute adherence to voltage constraints, thus classifying it as a soft constraint. For some constraints, instead, such as user satisfaction with EV charging and voltage control at certain nodes, the goal is to approach standard values as closely as possible, making them inherently soft constraints. The illustrations of different constraints of safe RL are shown in Fig. 2.

Refer to caption
Figure 2: Illustrations of different constraints of safe RL. (a): Cumulative constraints (4)-(5). (b): Probabilistic constraints (6). (c): Instantaneous constraints and hard constraints. (d): Soft constraint, where the final πぱいθしーたsubscript𝜋𝜃\pi_{\theta}italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT may be either safe or unsafe.

III Safe Reinforcement Learning

Safe RL is often formulated as a CMDP problem, where the objective is to maximize the reward of agents while ensuring that the agents satisfy safety constraints [18, 4]. Safe RL is categorized into different types from various perspectives. This section primarily categorizes these types based on the techniques used to ensure constraint satisfaction and provides detailed introductions of the techniques and benchmarks.

III-A Lagrangian Relaxation / Primal-Dual Method

Lagrangian relaxation, also known as primal-dual method, is the most common technique in safe RL. The key idea of this method is to transform the CMDP problem into an unconstrained dual problem. This is achieved by employing adaptive Lagrange multipliers to penalize constraints [19]:

Instantaneous::Instantaneousabsent\displaystyle\textbf{Instantaneous}:~{}Instantaneous :
minλらむだi0maxθしーた(λらむだi,θしーた)=minλらむだi0maxθしーた[JRπぱいθしーたiλらむだihi]subscriptsubscript𝜆𝑖0subscript𝜃subscript𝜆𝑖𝜃subscriptsubscript𝜆𝑖0subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖subscript𝜆𝑖subscript𝑖\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot h_{i}\right]roman_min start_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT caligraphic_L ( italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θしーた ) = roman_min start_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT [ italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] (7a)
Cumulative::Cumulativeabsent\displaystyle\textbf{Cumulative}:~{}Cumulative :
minλらむだi0maxθしーた(λらむだi,θしーた)=minλらむだi0maxθしーた[JRπぱいθしーたiλらむだi(Jhiπぱいθしーたεいぷしろんi)]subscriptsubscript𝜆𝑖0subscript𝜃subscript𝜆𝑖𝜃subscriptsubscript𝜆𝑖0subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖subscript𝜆𝑖superscriptsubscript𝐽subscript𝑖subscript𝜋𝜃subscript𝜀𝑖\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot\left(J_{h_{i}}^{\pi_{\theta}}-\varepsilon_{i}\right)\right]roman_min start_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT caligraphic_L ( italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θしーた ) = roman_min start_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT [ italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λらむだ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( italic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] (7b)

The solution of (7) relies on Danskin’s theorem and convex analysis [20]. Due to its straightforward implementation and compatibility with both on-policy and off-policy methods, Lagrangian relaxation has been integrated with other RL algorithms, fostering the creation of numerous variants, such as DDPG-Lag, PPO-Lag, TRPO-Lag, TD3-Lag, SAC-Lag, MAPPO, RCPO, PDO, TRPO-PID, CPPO-PID, DDPG-PID, TD3-PID, SAC-PID [21, 22, 19, 23].

The Lagrangian relaxation method is the most commonly used approach in power systems, capable of being easily integrated with various algorithms for application across a wide range of domains. Based on instantaneous or hard constraints, [24] utilizes a primal-dual approach to optimize the control of power generation and BESS charging and discharging actions in a multi-stage real-time stochastic dynamic OPF. Additionally, [25] applies constrained SAC to the Volt-VAR control problem by synergistically combining the merits of the maximum-entropy framework, the method of multipliers, a device-decoupled neural network structure, and an ordinal encoding scheme. Furthermore, [26] employs constrained RL for the predictive control of OPF, paired with EV charging control. On the other hand, based on cumulative or soft constraints, [27] approximates the actor gradients by solving the Karush-Kuhn-Tucker conditions of the Lagrangian, instead of constructing reward critic networks and cost critic networks through interactions with the environment. Then, the interior point method is incorporated to derive the parameter updating rule for the DRL agent. Similarly, [28] develops a soft-constraint enforcement method to adaptively encourage the control policy in the safety direction with nonconservative control actions and find decisions with near-zero degrees of constraint violations.

III-B Projection Method / Trust Region Method

The TRM ensures constraint satisfaction at every step and enhances performance by updating the trust region policy gradient and projecting the policy into a safe feasible set during each iteration [29]. Typical projection methods include CPO [9], PCPO [30], FOCOPS [31], CUP [32], and MACPO[22], among which PCPO is implemented through a two-step process: first, conducting a local reward update, and then projecting the policy back onto the constraint set to address any constraint violations, as depicted in Fig. 3.

Refer to caption
Figure 3: Update procedures for PCPO. In step one (red arrow), PCPO follows the reward improvement direction in the trust region (light green). In step two (blue arrow), PCPO projects the policy onto the constraint set (light orange).

In the power system domain, TRMs have also seen widespread application. For instance, [33] introduced a projection-embedded MA-DRL algorithm that smoothly and effectively restricts the DRL agent action space to prevent any violations of physical constraints, thereby achieving decentralized optimal control of distribution grids with a guaranteed 100% safety rate. Additionally, in the area of EV charging problems, [34] utilizes a penalty function to penalize the neural network output if it exceeds the action space and uses a projection operator to avoid incurring a negative reward when no EV is occupying the charging bay. In addition, [35] employs CPO for volt-VAR control to minimize the total operation costs while satisfying the physical operation constraints. However, TRMs, primarily based on TRPO or PPO, are not easily integrated with other RL types and are computationally intensive in high dimensions, limiting their suitability for large-scale safe RL problems [36].

III-C Lyapunov Method

Lyapunov functions, widely used in control engineering for controller design [37], were first applied to safe RL in [38]. The application of the Lyapunov method in power systems is limited because it requires prior knowledge of a Lyapunov function. If the model of environmental dynamics is unknown, identifying a suitable Lyapunov function can be challenging. For example, [39] integrates a Lyapunov function into the structural properties of primary frequency controllers, guaranteeing local asymptotic stability over a large set of states. Additionally, [40] utilizes Lyapunov theory to design the controller that satisfies specific Lipschitz constraints for decentralized inverter-based voltage control. In addition, [41] utilizes a stability-constrained RL method for real-time voltage control in distribution grids, providing a formal voltage stability guarantee using the Lyapunov function.

III-D Gaussian Process Method

GP [42] is widely utilized in numerous approaches to estimate uncertainty and identify unsafe areas. Consequently, assessments based on GP can be incorporated into the learning process to enhance agent safety [43]. GP-based safe RL algorithms include SafeOpt [44] and PILCO [45]. The application of GP method-based safe RL in power systems is limited, meriting further research to adequately address the various uncertainties inherent in power systems. The potential disadvantage of GP methods is their computational complexity and scalability issues, especially as the dimensionality of the problem space increases [36].

III-E Shielding Method

In [46], the shield is introduced for the first time in RL. This shield is computed in advance, based on the safety component of the system specification provided and an abstraction of the dynamics of the agent’s environment. It guarantees safety with minimal interference, implying that the shield limits the agent’s actions as little as necessary, only prohibiting actions that could jeopardize the safe behavior of the system. The shielded RL is shown in Fig. 4.

Refer to caption
Figure 4: Shielded RL. Machine learning is applied to control systems in such a way that the correctness of the system’s execution against a given specification is assured during both the learning and controller execution phases, regardless of the convergence speed of the learning process.

Shielding is a method that enforces constraint satisfaction, making it highly suitable for power system problems with hard constraints. For instance, in [47], actions that would lead to dangerous states, such as the SoC of BESSs being fully charged or depleted, are substituted by the shielding mechanism with safe actions to maintain system stability. Additionally, [48] combines a correction model adapted from gradient descent with the prediction model as a post-posed shielding mechanism to enforce safe actions in computer room air conditioning unit control problems. In addition, in unit commitment scheduling, [49] utilizes action space clipping to ensure that uncertainty estimates are reasonable and within appropriate bounds obtained from historical data. A potential drawback of the shielding method is the challenge of identifying feasible, safe actions based on infeasible ones, which requires underlying knowledge of the system. This can be difficult for certain complex systems or specific control scenarios [36].

III-F Safety Layer Method

Both the safety layer and shielding method integrate safety into the RL process, but they differ in their implementation: the safety layer acts as an additional check within the RL framework, whereas shielding employs an external system (the shield) that intervenes only when necessary to prevent unsafe actions. The safety layer method, first proposed in [50] for continuous action spaces in RL, emphasizes maintaining zero-constraint violations throughout the learning process. It expresses safety constraints as linear functions of action through a first-order approximation. Assuming that at most one constraint is violated at any time, an analytical solution to the safety layer optimization problem can be directly obtained. The linearization equation and visualization of the safety layer are shown in (8) and 5, respectively.

h¯i(st+1)hi(st,at)h¯i(st)+g(st;wi)Tatsubscript¯𝑖subscript𝑠𝑡1subscript𝑖subscript𝑠𝑡subscript𝑎𝑡subscript¯𝑖subscript𝑠𝑡𝑔superscriptsubscript𝑠𝑡subscript𝑤𝑖𝑇subscript𝑎𝑡\overline{h}_{i}(s_{t+1})\triangleq h_{i}(s_{t},a_{t})\approx\overline{h}_{i}(% s_{t})+g(s_{t};w_{i})^{T}a_{t}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≜ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (8)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are weights of NN; g(st;wi)𝑔subscript𝑠𝑡subscript𝑤𝑖g(s_{t};w_{i})italic_g ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes first-order approximation to hi(st,at)subscript𝑖subscript𝑠𝑡subscript𝑎𝑡h_{i}(s_{t},a_{t})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with respect to atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Refer to caption
Figure 5: Safety layer. Each safety signal hi(s,a)subscript𝑖𝑠𝑎h_{i}(s,a)italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) is approximated with a linear model with respect to a𝑎aitalic_a, whose coefficients are features of s𝑠sitalic_s, extracted with a NN.

The safety layer method has been widely applied in power systems. For example, in optimal power generation dispatch, [51] proposes a hybrid knowledge-data-driven safety layer to convert unsafe actions into the safety region, which is accelerated by a security-constrained linear projection model. Additionally, in volt-VAR control, [52] adds a safety layer to the policy neural network to enhance operational constraint satisfaction during both the initial exploration phase and the convergence phase. In addition, [53] uses action clipping, reward shaping, and expert demonstrations to ensure safe exploration and accelerate the training process during the online training stage for the assist service restoration problem. However, the linear approximation in the safety layer might not accurately capture the complexities of underlying dynamics in highly non-linear systems, and iterating at every time step could introduce a significant computational burden. Moreover, assuming only one constraint at a time may not be valid in complex environments where multiple safety constraints are concurrently active.

III-G Barrier Function Method

The barrier function method involves adding a barrier function penalty term to the original objective function. When the system state approaches the safety boundary, the value of the constructed barrier function tends to infinity, thereby ensuring that the state remains within the safe boundary [54]. The most typical barrier function method is IPO, which augments the objective with logarithmic barrier functions, drawing inspiration from the interior-point method [55]:

Instantaneous::Instantaneousabsent\displaystyle\textbf{Instantaneous}:Instantaneous : maxθしーたJRπぱいθしーた+i1tilog(hi)subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖1subscript𝑡𝑖subscript𝑖\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -h_{i})roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (9a)
Cumulative::Cumulativeabsent\displaystyle\textbf{Cumulative}:Cumulative : maxθしーたJRπぱいθしーた+i1tilog(Jhiπぱいθしーた+εいぷしろんi)subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖1subscript𝑡𝑖superscriptsubscript𝐽subscript𝑖subscript𝜋𝜃subscript𝜀𝑖\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -J_{h_{i}}^{\pi_{\theta}}+\varepsilon_{i})roman_max start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( - italic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (9b)

where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a hyperparameter for hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The illustration of IPO is shown in Fig. 6.

Refer to caption
Figure 6: Barrier function. The solid red line represents the logarithm barrier function log(Jhπぱいθしーた+εいぷしろん)/tsuperscriptsubscript𝐽subscript𝜋𝜃𝜀𝑡\log(-J_{h}^{\pi_{\theta}}+\varepsilon)/troman_log ( - italic_J start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_εいぷしろん ) / italic_t, which is a differentiable approximation of the indicator function I(x)𝐼𝑥I(x)italic_I ( italic_x ).

Barrier function method and IPO have been widely applied in power systems to ensure the safety of constraints. For example, [12] utilizes IPO to ensure the fulfillment of distribution network constraints without the need for designated penalty terms and the associated tuning of penalty factors, or repeatedly solving optimization problems for action rectification. Additionally, [56] uses IPO to facilitate desirable learning behavior towards constraint satisfaction and policy improvement simultaneously during online preventive control for transmission overload relief. In addition, [57] proposes a safe RL method for emergency load shedding in power systems, where the reward function includes a barrier function that approaches negative infinity as the system state approaches safety bounds. However, the accurate formulation and tuning of barrier functions necessitate knowledge of system dynamics, which can be challenging in complex environments.

III-H Robust Reinforcement Learning

One of the challenges in RL is generalization under uncertainties not seen during training. To address this, RRL frameworks have been developed, focusing on enhancing the reliability and robustness of RL agents for the worst-case scenarios [58, 59]. Two notable approaches in this context are chance-constrained RRL and constrained game-theoretic RL. It is important to note that RRL is not universally recognized as a safe RL algorithm in other fields. However, due to the significant uncertainties in power systems, RRL is employed to enhance control robustness and is reviewed here.

III-H1 Chance-constrained RRL

Chance-constrained RRL, in particular, focuses on ensuring that policies perform well under uncertain conditions by incorporating probabilistic constraints into the learning process [60]. In this framework, the goal is not just to maximize expected rewards but to do so while ensuring that the probability of undesirable outcomes (e.g., safety violations) remains below a specified threshold [61]. This is particularly important in scenarios where safety and reliability are critical, such as autonomous driving or robotics [62]. The general form can be expressed as:

maxπぱい𝒥Rπぱいθしーたsubscript𝜋superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi}\mathcal{J}_{R}^{\pi_{\theta}}{}{}{}{}{}{}{}{}{}{}{}{}{% }{}{}{}{}{}{}{}{}roman_max start_POSTSUBSCRIPT italic_πぱい end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (10)
s.t.[minihi(𝒔t,𝒂t,𝒔t+1)εいぷしろんi]ζぜーた,t𝒯formulae-sequences.t.delimited-[]subscript𝑖subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖𝜁for-all𝑡𝒯\displaystyle\text{s.t.}~{}~{}\mathbb{P}\left[\min_{i}h_{i}(\bm{s}_{t},\bm{a}_% {t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zeta,\forall t\in\mathcal{T}s.t. blackboard_P [ roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_εいぷしろん start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≥ italic_ζぜーた , ∀ italic_t ∈ caligraphic_T

III-H2 Constrained game-theoretic RL

Constrained game-theoretic RL is a framework that models the interaction between the RL agent and its environment as a game, specifically focusing on scenarios where there are constraints that the agent must respect during the learning and decision-making processes [63]. The objective is to maximize the agent’s rewards while minimizing the possible losses or costs, considering the worst-case scenarios posed by adversaries’ actions or environmental uncertainties [64]. Here’s a more accurate representation using a minimax optimization framework [63]:

minπぱいθしーたadvmaxπぱいθしーたsubscriptsuperscriptsubscript𝜋𝜃advsubscriptsubscript𝜋𝜃\displaystyle\min_{\pi_{\theta}^{\text{adv}}}\max_{\pi_{\theta}}roman_min start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝔼τたうπぱい[t=0γがんまtR(st,at,atadv,st+1)]subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡superscriptsubscript𝑎𝑡advsubscript𝑠𝑡1\displaystyle~{}\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s% _{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] (11)
s.t.his.t.subscript𝑖\displaystyle\text{s.t.}~{}~{}h_{i}s.t. italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (st,at,atadv,st+1)0,t𝒯formulae-sequencesubscript𝑠𝑡subscript𝑎𝑡superscriptsubscript𝑎𝑡advsubscript𝑠𝑡10for-all𝑡𝒯\displaystyle(s_{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\leq 0,\forall t\in% \mathcal{T}( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ 0 , ∀ italic_t ∈ caligraphic_T

One of the key benefits of constrained game-theoretic RL is its ability to handle competitive and cooperative interactions within complex environments, making it suitable for applications ranging from strategic games to cooperative multi-agent scenarios like mobile edge computing [65] and coordination in robotic teams [66].

RRL is applied in power systems to ensure that control strategies remain robust under various uncertainties. For example, [14] utilizes adversarial safe RL to address the model inaccuracy and uncertainty of virtual power plants without relying on an accurate environmental model. Additionally, in the sequential OPF problem, [51] employs a bi-level robust optimization approach to optimize the training loss of the Q network. In addition, in the inverter-based volt-VAR control problem, [16] develops a highly efficient adversarial RL algorithm to train an offline agent that is robust to model mismatches during the offline stage.

III-I Benchmarks

Benchmarks include both benchmark environments and benchmark algorithms. Safety Gym, developed by OpenAI, is the first widely recognized safe benchmark environment. It includes an environment-builder and a suite of pre-configured benchmark environments [21, 67]. Correspondingly, Safety Starter Agents, a benchmark algorithm library, has been developed based on Safety Gym [68]. The supported algorithms in this library include PPO, PPO-Lag, TRPO, TRPO-Lag, SAC, SAC-Lag, and CPO. This package has been tested on Mac OS Mojave and Ubuntu 16.04 LTS and is likely compatible with most recent Mac and Linux operating systems.

Safety Gymnasium, an update and extension of Safety Gym, has currently become the mainstream platform in use [69, 70]. Correspondingly, a benchmark repository for safe RL algorithms has been proposed, named SafePO [71]. SafePO is tested on the Linux platform and potentially supports Mac or Windows, requiring only modifications to the Linux path and sort functions for compatibility.

SafePO further extends the variety of supported safe RL algorithms, as illustrated in Fig. 7.

Refer to caption
Figure 7: Supported safe RL algorithms of SafePO.

OmniSafe emerges as the first unified learning framework in the field of safe RL, featuring a highly modular framework that includes a comprehensive collection of algorithms specifically developed for safe RL across various domains. Its versatility comes from an abstracted algorithm structure and a well-designed API, facilitating seamless integration of different components, thereby simplifying extension and customization for developers. Additionally, OmniSafe enhances algorithm learning speeds through process parallelism, supporting both environment-level and agent asynchronous parallel learning. OmniSafe is supported and tested on Linux and also supports M1 and M2 versions of macOS. However, it does not support Windows [72, 73]. The supported safe RL algorithms of OmniSafe are shown in Table I.

TABLE I: Supported Safe RL Algorithms of OmniSafe
Domains Types Algorithms Registry
On Policy Primal-Dual TRPO-Lag; PPO-Lag; PDO; RCPO
Convex Optimization CPO; PCPO; FOCOPS; CUP
Penalty Function IPO; P3O
Primal OnCRPO
Off Policy Primal-Dual DDPG-Lag; TD3-Lag; SAC-Lag
DDPG-PID; TD3-PID; SAC-PID
Model-based Online Plan SafeLOOP; CCEPETS; RCEPETS
Pessimistic Estimate CAPPETS
Offline Q-Learning-Based BCQ-Lag; C-CRR
DICE-Based COptDICE
ET-MDP PPO/TRPO-EarlyTerminated
Other MDP SauteRL PPOSaute; TROPSaute
SimmerRL PPOSimmer-PID; TROPSimmer-PID

Overall, Safety Gymnasium is the current mainstream benchmark environment, and OmniSafe has also integrated Safety Gymnasium to ensure overall code compatibility. It is important to remark that Safety Gymnasium was primarily developed for control in gaming, robotics, autonomous driving, etc., featuring a series of agents such as point, car, dog, and ant, among others. It offers several specific environments tailored for challenges such as safe navigation, safe velocity, and safe vision, but it is not directly applicable to power systems problems’ formulations. Hence, there is a need to develop corresponding power system control environments based on the environment templates provided by Safety Gymnasium. In terms of benchmark algorithms, OmniSafe offers a more comprehensive set of algorithms but currently does not support Windows due to difficulties with Python library installations. In contrast, SafePO is more easily expanded on Windows. Since most power system professional software is developed for Windows, with less support for Linux and macOS, this may limit the application of OmniSafe in model-based environments. However, if surrogate models are used to substitute for physical models in a model-free environment, OmniSafe can be utilized in Linux or macOS.

Refer to caption
Figure 8: RL schemes for the safe control and decision-making in power systems.

IV Power System Applications of Safe RL

This review synthesizes a broad collection of studies and applications of safe RL in power systems, covering a wide array of domains: optimal power generation dispatch, voltage control, stability control, EV charging control, building energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Safe RL algorithms used in various application domains are presented in Fig. 1. As depicted in Fig. 8, RL-based schemes collect power system measurements, including PMU and AMI readings, and integrate system model knowledge into their policy training. They take action to control power system devices, ensuring safety requirements like feasibility, stability, and robustness are met. The research problem or objective function, constraint, constraint type (cumulative/instantaneous and hard/soft), applied safe constraint techniques, and key features are reviewed to compare different researches using safe RL across various domains.

IV-A Optimal Power Generation Dispatch

TABLE II: Safe RL Applications in Optimal Power Generation Dispatch

Research Problem/ Objective Constraint Constraint Type Safety Constraint Techniques Key Features [27] Minimize the total generation cost Physical operation constraints Cum/Soft Primal-dual method (III-A) Combines the primal-dual DDPG with the classic SCOPF model. The actor gradients are approximated by solving the Karush-Kuhn-Tucker conditions of the Lagrangian. [24] Minimize the fuel costs and power loss from BESSs Physical constraints Ins/Hard Projection (III-B) and primal-dual method (III-A) A primal-dual approach is introduced to learn optimal constrained DRL policies specifically for predictive control in real-time stochastic dynamic OPF. [74] Minimize the total system cost Physical constraints Cum/Hard Safety layer (III-F) Unsafe actions are projected into the safe action space while constrained zonotope set is used to improve efficiency. [75] Minimize the cost of thermal power MESS Power grid and MESSs constraints Ins/Hard Proximal gradient projection (III-B) MESSs are modeled as CMDP, and a framework is proposed based on a DRL algorithm that considered the discrete-continuous hybrid action space of the MESSs. [15] Minimize the total energy cost Power system constraints Cum/Hard Lagrange relaxation (III-A) and logarithmic barrier (III-G) Function approximation addresses large, continuous state spaces, while a diffusion strategy coordinates actions of DG units and ESSs. [76] Minimize the generator fuel cost Power system constraints Ins/Hard Safety layer (III-F) The proposed method uses physics-driven parameters for easy modification and less conservative, easily re-parameterizable actions. [77] Minimize the operating cost Power system constraints Ins/Hard Safety layer (III-F) To avoid line overload, a safety layer is added by introducing transmission constraints to avoid dangerous actions and tackle sequential security-constrained OPF problem. [10] Minimize the total operating cost Physical constraints of system and devices Cum/Hard CPO (III-B) To optimize both discrete and continuous actions, a stochastic policy based on a joint distribution of mixed random variables is designed and learned through a NN approximator. [11] Minimize the total cost of operation of microgrids Global and local constraints Cum/Soft Lagrangian relaxation (III-A) and projection (III-B) The training process employs the gradient information of operational constraints to ensure that the optimal control policy functions generate safe and feasible decisions. [78] Minimize the operational cost Operation and power balance constraints Cum/Hard CPO (III-B) and invalid action masking (III-E) Invalid action masking is applied to avoid invalid actions, accomplished by replacing the logits of the actions to be masked with a large negative number. [79] Minimize the total operational cost AC-PF constraints Cum/Hard CPO (III-B) Contrary to traditional DRL methods, the proposed method constrains exploration to only those policies that comply with AC-PF constraints. [28] Minimize the total operational cost Gas system and power system constraints Cum/Soft Lagrangian relaxation (III-A) The penalty is adaptively updated based on the extent of constraint violation, facilitating the prediction of near-optimal control actions that achieve near-zero degrees of violation. [80] Minimize the operating cost for the whole horizon Operational constraints Ins/Hard MIP formulation The action-value function, approximated through a DNN, is structured as a MIP formulation, enabling the inclusion of constraints within the action space. [81] Optimize the total generation cost Operational and linguistic stipulation constraints N.A./Soft Primal-dual method (III-A) For the first time, a GPT LLM is integrated into the OPF framework alongside linguistic rules. This novel approach models and quantifies natural language stipulations as objectives and constraints within a primal-dual DRL loop. [82] Minimize the total operation cost Operational constraints N.A./Soft Lagrangian relaxation (III-A) Instead of using the critic network, the deterministic gradient is derived analytically and solved by using interior point method. [83] Minimize the total energy cost Satisfaction of the energy demand Cum/Soft Lagrangian relaxation (III-A) and RRL (III-H) This approach efficiently uses short-horizon forecasts to prevent energy demand failures and reduce costs, surpassing the capabilities of standard safe RL methods. [12] Minimize the costs of DGs production and RES curtailment Constraints of distribution network Cum/Hard IPO (III-G) The generalization of IPO is improved by extracting spatial-temporal features from microgrid operation data, leveraging the advantages of edge-conditioned convolutional networks and long short-term memory networks. [84] Multi-energy management Thermal energy balance Cum/Hard Shielding method (III-E) Decoupling architecture of safety constraint formulations from the RL formulation. Hard-constraint satisfaction without the need to solve a mathematical program. [85] Minimize the cost of electricity net, DG and gas Constraints of the power and gas networks Ins/Hard Safety layer (III-F) By learning a dynamic security assessment rule, a physically-informed safety layer ensures adherence to physical constraints by solving an action correction formulation. [14] Minimize the overall operation cost Branch power flow security constraint Ins/Soft Lagrangian relaxation (III-A) and RRL (III-H) An adversarial safe RL approach is proposed to enhance action safety and robustness against deviations between training and testing environments. [51] Minimize the operation cost Operational constraints Ins/Hard Safety layer (III-F), projection (III-B), and RRL (III-H) A safety layer that blends knowledge and data-driven approaches is created. Also, security constraints and linear projection are combined to improve computational speed.

  • Cum: Cumulative; Ins: Instantaneous; N.A.: Not applicable or not available.

Optimal power generation dispatch considering various constraints, ranging from simplified versions to security constraints, including economic dispatch, DC-OPF, AC-OPF, and SCOPF. The operation of a power system must meet both security and economic requirements. Considering credible contingencies, AC-OPF has been widely used [79, 86]. Most existing methods for solving OPF rely on analytical methods; however, given the inherently large scale of these problems, real-time computation is very challenging. A new variation of OPF is the SCOPF. This type of problem requires significantly longer computation times due to the additional security constraints [27]. To accelerate the calculation of SCOPF, methods such as DC-PF approximation [87], convex power flow approximation [88], and convex security constraint approximation [89] have been proposed. However, the accuracy of these methods has been questioned, and they remain time-consuming for large-scale systems. To accelerate computation and achieve better solutions, RL methods have been widely applied. Since traditional RL struggles to handle safety constraints effectively, safe RL has been further applied to address these issues.

The details of the applications of safe RL in optimal power generation dispatch are shown in Table II. Based on Table II, we summarize the foundational framework for implementing safe RL in optimal power generation dispatch with a specific example with SGs, RESs, and BESSs, incorporating strict physics-based constraints such as AC- and DC-PF constraints. If the system encompasses additional power system devices, the presented equations are designed to be readily scalable to accommodate them. Note that the models presented below are examples for illustration, and there are other RL formulations and models for optimal power generation dispatch depending on the specific problem setting. This is also true for other application domains. The state, action, reward, and constraints of optimal power generation dispatch are shown as follows.

IV-A1 AC-PF

AC-PF constraints describe the basic physics of power systems, which have been widely considered in optimal power generation dispatch, voltage control, unit commitments, etc.

State

The states include active and reactive loads and voltage:

𝒔tAC(𝒗t,𝒑tLoad,𝒒tLoad)subscriptsuperscript𝒔AC𝑡subscript𝒗𝑡subscriptsuperscript𝒑Load𝑡subscriptsuperscript𝒒Load𝑡\bm{s}^{\text{AC}}_{t}\triangleq\left(\bm{v}_{t},\bm{p}^{\text{Load}}_{t},\bm{% q}^{\text{Load}}_{t}\right)bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_q start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (12)
Action

The control actions encompass both active and reactive power generation of SGs, active power generation of RESs, alongside power charging or discharging of BESSs:

𝒂tAC(𝒑tSG,𝒒tSG,𝒑tRES,𝒑ch,tBESS,𝒑dis,tBESS)subscriptsuperscript𝒂AC𝑡subscriptsuperscript𝒑SG𝑡subscriptsuperscript𝒒SG𝑡subscriptsuperscript𝒑RES𝑡subscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑BESSdis𝑡\bm{a}^{\text{AC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{q}^{\text{SG}% }_{t},\bm{p}^{\text{RES}}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text% {BESS}}_{\text{dis},t}\right)bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_q start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ) (13)
Reward

The reward includes SGs generation cost, wind curtailment cost, and BESSs cost:

maxπぱいθしーたΠぱいSsubscriptsubscript𝜋𝜃subscriptΠぱい𝑆\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}roman_max start_POSTSUBSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT ∈ roman_Πぱい start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝔼τたうπぱい[t=0γがんまtR(𝒔t,𝒂t,𝒔t+1)]subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1\displaystyle\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(\bm{% s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_τたう ∼ italic_πぱい end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] (14a)
RAC(𝒔,𝒂)superscript𝑅AC𝒔𝒂\displaystyle R^{\text{AC}}(\bm{s},\bm{a})italic_R start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) =|i𝒢(aiSG(pi,tSG)2+biSGpi,tSG+ciSG)|absentsubscriptfor-all𝑖𝒢subscriptsuperscript𝑎SG𝑖superscriptsubscriptsuperscript𝑝SG𝑖𝑡2subscriptsuperscript𝑏SG𝑖subscriptsuperscript𝑝SG𝑖𝑡subscriptsuperscript𝑐SG𝑖\displaystyle=-\left|\sum_{\forall i\in\mathcal{G}}\left(a^{\text{SG}}_{i}(p^{% \text{SG}}_{i,t})^{2}+b^{\text{SG}}_{i}p^{\text{SG}}_{i,t}+c^{\text{SG}}_{i}% \right)\right|= - | ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_G end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + italic_c start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) |
iciRES|pMPPT,i,tRESpi,tRES|subscriptfor-all𝑖subscriptsuperscript𝑐RES𝑖subscriptsuperscript𝑝RESMPPT𝑖𝑡subscriptsuperscript𝑝RES𝑖𝑡\displaystyle\quad-\sum_{\forall i\in\mathcal{R}}c^{\text{RES}}_{i}\left|p^{% \text{RES}}_{\text{MPPT},i,t}-p^{\text{RES}}_{i,t}\right|- ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_R end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT MPPT , italic_i , italic_t end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT |
icdis,iBESSpdis,i,tBESS+icch,iBESSpch,i,tBESSsubscriptfor-all𝑖subscriptsuperscript𝑐BESSdis𝑖subscriptsuperscript𝑝BESSdis𝑖𝑡subscriptfor-all𝑖subscriptsuperscript𝑐BESSch𝑖subscriptsuperscript𝑝BESSch𝑖𝑡\displaystyle\quad-\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}}_{\text{dis},i% }p^{\text{BESS}}_{\text{dis},i,t}+\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}% }_{\text{ch},i}p^{\text{BESS}}_{\text{ch},i,t}- ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_B end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_i end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_i , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_B end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_i end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_i , italic_t end_POSTSUBSCRIPT (14b)
𝒔tACsubscriptsuperscript𝒔AC𝑡\displaystyle\bm{s}^{\text{AC}}_{t}bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ft(𝒔t1AC,𝒂t1AC)𝒂tACπぱい(𝒂tAC|𝒔t1AC)absentsubscript𝑓𝑡subscriptsuperscript𝒔AC𝑡1subscriptsuperscript𝒂AC𝑡1subscriptsuperscript𝒂AC𝑡similar-to𝜋conditionalsubscriptsuperscript𝒂AC𝑡subscriptsuperscript𝒔AC𝑡1\displaystyle=f_{t}(\bm{s}^{\text{AC}}_{t-1},\bm{a}^{\text{AC}}_{t-1})~{}~{}~{% }\bm{a}^{\text{AC}}_{t}\sim\pi(\bm{a}^{\text{AC}}_{t}|\bm{s}^{\text{AC}}_{t-1})= italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_πぱい ( bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (14c)
Constraint

The control actions derived from DRL must adhere to physics-hard constraints. AC-PF constraints include bus active and reactive power balance constraints, SG active and reactive power generation constraints, RES active power generation constraints, voltage constraints, and branch apparent power constraints:

𝐌BESS𝒑dis,tBESS𝐌BESS𝒑ch,tBESS+𝐌SG𝒑tSG+superscript𝐌BESSsubscriptsuperscript𝒑BESSdis𝑡superscript𝐌BESSsubscriptsuperscript𝒑BESSch𝑡limit-fromsuperscript𝐌SGsuperscriptsubscript𝒑𝑡SG\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT - bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT + bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT +
𝐌RES𝒑tRES𝒑tLoad={𝔻(𝒗t𝒗t𝐘)}superscript𝐌RESsuperscriptsubscript𝒑𝑡RESsubscriptsuperscript𝒑Load𝑡𝔻subscript𝒗𝑡superscriptsubscript𝒗𝑡superscript𝐘\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\Re\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{% H}})\}bold_M start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_ℜ { blackboard_D ( bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT bold_Y start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) } (15a)
𝐌SG𝒒tSG𝒒tLoad={𝔻(𝒗t𝒗t𝐘)}superscript𝐌SGsuperscriptsubscript𝒒𝑡SGsubscriptsuperscript𝒒Load𝑡𝔻subscript𝒗𝑡superscriptsubscript𝒗𝑡superscript𝐘\displaystyle\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}-\bm{q}^{\text{Load}}% _{t}=\Im\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{H}% })\}bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT - bold_italic_q start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_ℑ { blackboard_D ( bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT bold_Y start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) } (15b)
𝒑¯SG𝒑tSG𝒑¯SG𝒒¯SG𝒒tSG𝒒¯SGsuperscript¯𝒑SGsubscriptsuperscript𝒑SG𝑡superscript¯𝒑SGsuperscript¯𝒒SGsubscriptsuperscript𝒒SG𝑡superscript¯𝒒SG\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{q}}^{\text{SG}}\leq\bm{q}% ^{\text{SG}}_{t}\leq\overline{\bm{q}}^{\text{SG}}under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT under¯ start_ARG bold_italic_q end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ≤ bold_italic_q start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_q end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT (15c)
𝒑¯RES𝒑tRES𝒑¯RES𝒗¯|𝒗|𝒗¯|sij|s¯ijsuperscript¯𝒑RESsubscriptsuperscript𝒑RES𝑡superscript¯𝒑RES¯𝒗𝒗¯𝒗subscript𝑠𝑖𝑗subscript¯𝑠𝑖𝑗\displaystyle\underline{\bm{p}}^{\text{RES}}\leq\bm{p}^{\text{RES}}_{t}\leq% \overline{\bm{p}}^{\text{RES}}~{}~{}~{}\underline{\bm{v}}\leq|{\bm{v}}|\leq% \overline{\bm{v}}~{}~{}~{}|{s}_{ij}|\leq\overline{s}_{ij}under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT under¯ start_ARG bold_italic_v end_ARG ≤ | bold_italic_v | ≤ over¯ start_ARG bold_italic_v end_ARG | italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (15d)

where 𝐌SGsuperscript𝐌SG\mathbf{M}^{\text{SG}}bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT denotes the matrix {0,1}N×Gsuperscript01𝑁𝐺\{0,1\}^{N\times G}{ 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_G end_POSTSUPERSCRIPT that maps the generation vector 𝒑tSG|𝒢|superscriptsubscript𝒑𝑡SGsuperscript𝒢\bm{p}_{t}^{\text{SG}}\in\mathbb{R}^{|{\cal G}|}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_G | end_POSTSUPERSCRIPT to Nsuperscript𝑁\mathbb{R}^{N}blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT:

[𝐌SG𝒑tSG]i=0[𝐌SG𝒒tSG]i=0,i𝒩𝒢formulae-sequencesubscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒑𝑡SG𝑖0subscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒒𝑡SG𝑖0for-all𝑖𝒩𝒢\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=0~{}~{}~{}[% \mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=0,~{}~{}\forall i\in\mathcal% {N}\setminus\mathcal{G}[ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 [ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , ∀ italic_i ∈ caligraphic_N ∖ caligraphic_G (16a)
[𝐌SG𝒑tSG]i=pjSG[𝐌SG𝒒tSG]i=qjSG,i𝒢,j[G]formulae-sequencesubscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒑𝑡SG𝑖subscriptsuperscript𝑝SG𝑗subscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒒𝑡SG𝑖subscriptsuperscript𝑞SG𝑗formulae-sequencefor-all𝑖𝒢for-all𝑗delimited-[]𝐺\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=p^{\text{SG}}_% {j}~{}~{}~{}[\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=q^{\text{SG}}_{% j},~{}~{}\forall i\in\mathcal{G},\forall j\in[G][ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_q start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_i ∈ caligraphic_G , ∀ italic_j ∈ [ italic_G ] (16b)

IV-A2 DC-PF

DC-PF constraints represent the linear relaxations of AC-PF, which are commonly included in optimal power generation dispatch and electricity market considerations.

State

The voltage and reactive power are overlooked in DC-PF.

𝒔tDC(ϑt,𝒑tLoad)subscriptsuperscript𝒔DC𝑡subscriptbold-italic-ϑ𝑡subscriptsuperscript𝒑Load𝑡\bm{s}^{\text{DC}}_{t}\triangleq\left(\bm{\vartheta}_{t},\bm{p}^{\text{Load}}_% {t}\right)bold_italic_s start_POSTSUPERSCRIPT DC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_ϑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (17)
Action

The action involves only the generation or consumption of active power.

𝒂tDC(𝒑tSG,𝒑tRES,𝒑ch,tBESS,𝒑dis,tBESS)subscriptsuperscript𝒂DC𝑡subscriptsuperscript𝒑SG𝑡subscriptsuperscript𝒑RES𝑡subscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑BESSdis𝑡\bm{a}^{\text{DC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{p}^{\text{RES% }}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text{BESS}}_{\text{dis},t}\right)bold_italic_a start_POSTSUPERSCRIPT DC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ) (18)
Reward

The reward is similar with the AC-PF (14).

Constraint

The DC-PF constraints are a simplification of the AC-PF constraints, retaining only the active power components and disregarding voltage issues [90].

𝐌BESS𝒑dis,tBESS𝐌BESS𝒑ch,tBESS+𝐌SG𝒑tSG+superscript𝐌BESSsubscriptsuperscript𝒑BESSdis𝑡superscript𝐌BESSsubscriptsuperscript𝒑BESSch𝑡limit-fromsuperscript𝐌SGsuperscriptsubscript𝒑𝑡SG\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT - bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT + bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT + (19a)
𝐌RES𝒑tRES𝒑tLoad=𝐁ϑtsuperscript𝐌RESsuperscriptsubscript𝒑𝑡RESsubscriptsuperscript𝒑Load𝑡𝐁subscriptbold-italic-ϑ𝑡\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\mathbf{B}\bm{\vartheta}_{t}bold_M start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_B bold_italic_ϑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
𝒑¯SG𝒑tSG𝒑¯SG𝒑¯RES𝒑tRES𝒑¯RESsuperscript¯𝒑SGsubscriptsuperscript𝒑SG𝑡superscript¯𝒑SGsuperscript¯𝒑RESsubscriptsuperscript𝒑RES𝑡superscript¯𝒑RES\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{p}}^{\text{RES}}\leq\bm{p% }^{\text{RES}}_{t}\leq\overline{\bm{p}}^{\text{RES}}under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT (19b)
|pij|p¯ijsubscript𝑝𝑖𝑗subscript¯𝑝𝑖𝑗\displaystyle|{p}_{ij}|\leq\overline{p}_{ij}| italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (19c)

IV-A3 BESS Constraints

The BESS constraints include charging and discharging constraints, and SoC constraints.

0𝒑ch,tBESS𝒑¯chBESS0𝒑dis,tBESS𝒑¯disBESS0subscriptsuperscript𝒑BESSch𝑡subscriptsuperscript¯𝒑BESSch0subscriptsuperscript𝒑BESSdis𝑡subscriptsuperscript¯𝒑BESSdis\displaystyle 0\leq\bm{p}^{\text{BESS}}_{\text{ch},t}\leq\overline{\bm{p}}^{% \text{BESS}}_{\text{ch}}~{}~{}~{}0\leq\bm{p}^{\text{BESS}}_{\text{dis},t}\leq% \overline{\bm{p}}^{\text{BESS}}_{\text{dis}}0 ≤ bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT 0 ≤ bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT (20a)
𝑺𝒐𝑪¯BESS𝑺𝒐𝑪tBESS𝑺𝒐𝑪¯BESSsuperscript¯𝑺𝒐𝑪BESS𝑺𝒐subscriptsuperscript𝑪BESS𝑡superscript¯𝑺𝒐𝑪BESS\displaystyle\underline{\bm{SoC}}^{\text{BESS}}\leq\bm{SoC}^{\text{BESS}}_{t}% \leq\overline{\bm{SoC}}^{\text{BESS}}under¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT ≤ bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT (20b)
𝑺𝒐𝑪tBESS=𝑺𝒐𝑪t1BESS+ΔでるたtEcapBESS(ηいーたchBESS𝒑ch,tBESS𝒑dis,tBESSηいーたdisBESS)𝑺𝒐subscriptsuperscript𝑪BESS𝑡𝑺𝒐subscriptsuperscript𝑪BESS𝑡1Δでるた𝑡superscriptsubscript𝐸capBESSsubscriptsuperscript𝜂BESSchsubscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑BESSdis𝑡subscriptsuperscript𝜂BESSdis\displaystyle\bm{SoC}^{\text{BESS}}_{t}=\bm{SoC}^{\text{BESS}}_{t-1}+\frac{% \Delta t}{E_{\text{cap}}^{\text{BESS}}}\Big{(}\eta^{\text{BESS}}_{\text{ch}}% \bm{p}^{\text{BESS}}_{\text{ch},t}-\frac{\bm{p}^{\text{BESS}}_{\text{dis},t}}{% \eta^{\text{BESS}}_{\text{dis}}}\Big{)}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG roman_Δでるた italic_t end_ARG start_ARG italic_E start_POSTSUBSCRIPT cap end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT end_ARG ( italic_ηいーた start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT - divide start_ARG bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_ηいーた start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT end_ARG ) (20c)

IV-B Voltage Control

TABLE III: Safe RL Applications in Voltage Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [33] Minimize transmission losses Voltage and other system constraints Ins/Hard Projection layer (III-B) Through an embedded safe policy projection, it is possible to smoothly and effectively limit the action space, thereby preventing any breach of physical constraints. [40] Minimize cost Voltage constraint Ins/Hard Lyapunov stability (III-C) Ensuring that each NN controller satisfies certain Lipschitz constraints to inherently meet these constraints, thus guaranteeing the system maintains exponential stability. [91] Minimize transmission loss Voltage and power flow constraints Ins/Hard Finite iteration projection (III-B) A finite iteration projection algorithm is proposed to guarantee hard constraints by converting a non-convex optimization problem into a finite iteration problem. [52] Minimize the cost of network loss and device switching Voltage and power flow constraints Cum/Hard Safety layer (III-F) A safety layer is added to the policy NN to enhance operational constraint satisfaction for both initial exploration phase and convergence phase. [17] Minimize total network energy loss Voltage deviations Cum/Soft Primal-dual policy (III-A) Each zone has a central control agent that embeds GCNs to improve the decision-making capability. The primal-dual method is used to rigorously satisfy voltage safety constraints. [92] Minimize active power loss Voltage violations Cum/Soft Lagrangian relaxation (III-A) A MACSAC RL algorithm is proposed, which is utilized to train control agents online, eliminating the need for accurate ADN models. [47] Active voltage control SoC of BESSs Ins/Hard Physics-based shielding (III-E) The physics-shielded MATD3 algorithm is proposed, capable of replacing dangerous actions with safe ones as the BESSs approach dangerous SoC. [93] Minimize the ADN power losses and control efforts Voltage and power grid constraints Ins/Hard Safety layer (III-F) A safety layer is directly integrated on top of the DDPG actor network to forecasts changes in constrained states and prevents the violation of operational constraints in ADNs. [94] Minimize the network power loss Nodal voltage constraint Ins/Hard Safety projection (III-B) In the training stage, the safety projection is added to the combined policy to analytically solve an action correction formulation to achieve guaranteed 100% voltage security. [25] Minimize the cost of losses and the device switching Voltage constraint Ins/Soft Lagrangian relaxation (III-A) A safe off-policy DRL, Constrained SAC, is proposed to solve Volt-VAR control problems in a model-free manner. [95] Minimize the total control cost Voltage constraint Ins/Hard Safety projection layer (III-B) By leveraging the underlying grid information, a projection layer is designed to project the reactive power injection into a safe set of nodal voltage magnitudes. [41] Minimize the voltage deviation and control cost Voltage constraint Ins/Hard Lyapunov function (III-C) An explicitly constructed Lyapunov function is utilized to certify stability for all monotone policies without knowledge of the underlying model parameters. [96] Minimize the cost of electricity and BESSs maintenance Voltage constraint and ADN constraints Cum/Soft SAC with safety module A model-free DRL algorithm, integrated with a safety module, is proposed to minimize voltage violations and real power losses, with a design that guarantees no voltage violations occur during the online training. [35] Minimize the total operation costs Physical constraints Cum/Hard CPO (III-B) The voltage control problem is formulated as a CMDP and solved by TRPO and CPO to enable safe exploration. [16] Minimize voltage violations and network losses Voltage bound constraints Cum/Soft Penalty function and RRL (III-H) An adversarial RL algorithm has been developed to train an offline agent that is robust against model mismatches.

Voltage control is designed to ensure the magnitudes of voltage across power networks remain close to nominal values or within an acceptable range. For example, Fig. 9 shows the Volt/Var/Watt curves of voltage control [97]. Instead of directly controlling the active and reactive power injections of smart inverters, some researchers have proposed resetting the Volt/Var/Watt curves to control the voltage profiles [98, 99]. Increasing penetration levels of RESs, such as the large-scale deployment of wind farms in transmission systems and the widespread installation of distributed PVs and EVs in distribution networks, have led to significant changes in power system behavior. Due to the distribution networks typically being radial or distributed in structure and connecting a large number of intermittent and uncertain distributed RESs, voltage management has become more complex and challenging, often leading to voltage violations (either below 0.95 p.u. or above 1.05 p.u.) [100, 101]. Many current studies on voltage regulation utilize a physical model-based optimization/control method, employing convex relaxation techniques like second-order cone programming to simplify AC-PF constraints. This approach allows for efficient resolution using conventional solvers [33, 25, 102]. The application of Safe RL in the area of voltage control is detailed in Table III. According to Table III, we take the smarter inverters of DGs and BESSs as a prime example to summarize the voltage control problem associated with safe RL. The state, action, reward, and constraints of voltage control are shown as follows:

Refer to caption
Figure 9: The left figure shows the Volt/Var/Watt curves and the right is the feasible region of the inverter for any sets of parameters 𝜷isubscript𝜷𝑖\bm{\beta}_{i}bold_italic_βべーた start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and where the two regions in blue correspond to the charging and discharging mode indicated by ηいーた𝜂\etaitalic_ηいーた. It should be noted that for solar panels ηいーたt(i)=0superscriptsubscript𝜂𝑡𝑖0\eta_{t}^{(i)}=0italic_ηいーた start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 0, hence, the left region of the inverter is inactive.

IV-B1 Volt/Var Control with AC-PF Constraints

State

The state variables are represented by PMU measurements, with sensors installed at buses denoted by 𝒩PMUsuperscript𝒩PMU\mathcal{N}^{\text{PMU}}caligraphic_N start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT, or AMI measurements, with sensors installed at buses denoted by 𝒩AMIsuperscript𝒩AMI\mathcal{N}^{\text{AMI}}caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT. Thus, the state variable 𝒔𝒔\bm{s}bold_italic_s is comprehensively defined by:

𝒔PMUsuperscript𝒔PMU\displaystyle\bm{s^{\text{PMU}}}bold_italic_s start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT ((vi)i𝒩PMU,(ii)i𝒩PMU)absentsubscriptsubscript𝑣𝑖𝑖superscript𝒩PMUsubscriptsubscript𝑖𝑖𝑖superscript𝒩PMU\displaystyle\triangleq\left((v_{i})_{i\in\mathcal{N}^{\text{PMU}}},(i_{i})_{i% \in\mathcal{N}^{\text{PMU}}}\right)≜ ( ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ( italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (21a)
𝒔AMIsuperscript𝒔AMI\displaystyle\bm{s}^{\text{AMI}}bold_italic_s start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT ((|vi|2)i𝒩AMI,(|ii|2)i𝒩AMI,(sap,i)i𝒩AMI)absentsubscriptsuperscriptsubscript𝑣𝑖2𝑖superscript𝒩AMIsubscriptsuperscriptsubscript𝑖𝑖2𝑖superscript𝒩AMIsubscriptsubscript𝑠𝑎𝑝𝑖𝑖superscript𝒩AMI\displaystyle\triangleq\left(({|v_{i}|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},({% |i_{i}|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},(s_{ap,i})_{i\in\mathcal{N}^{% \text{AMI}}}\right)≜ ( ( | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ( | italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT italic_a italic_p , italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (21b)

The system dynamics that depict the environment can be formulated as

𝒔t+1V𝒇(𝒔tV,𝒂tV)subscriptsuperscript𝒔V𝑡1𝒇subscriptsuperscript𝒔V𝑡subscriptsuperscript𝒂V𝑡\displaystyle\bm{s}^{\text{V}}_{t+1}\triangleq\bm{f}(\bm{s}^{\text{V}}_{t},\bm% {a}^{\text{V}}_{t})bold_italic_s start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≜ bold_italic_f ( bold_italic_s start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (22)
Action

The control actions include regulating the DGs, BESSs, and other components.

𝒂tV(𝒑tDG,𝒒tDG,𝒑tBESS,𝒑tother)subscriptsuperscript𝒂V𝑡subscriptsuperscript𝒑DG𝑡subscriptsuperscript𝒒DG𝑡subscriptsuperscript𝒑BESS𝑡subscriptsuperscript𝒑other𝑡\bm{a}^{\text{V}}_{t}\triangleq\left(\bm{p}^{\text{DG}}_{t},\bm{q}^{\text{DG}}% _{t},\bm{p}^{\text{BESS}}_{t},\bm{p}^{\text{other}}_{t}\right)bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_q start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT other end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (23)
Reward

The reward is to maintain the voltage magnitudes close to the nominal value vrefsubscript𝑣refv_{\text{ref}}italic_v start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT (typically 1.0 p.u.):

RV(𝒔,𝒂)=𝒗tvrefsuperscript𝑅V𝒔𝒂normsubscript𝒗𝑡subscript𝑣refR^{\text{V}}(\bm{s},\bm{a})=-\|{\bm{v}_{t}-v_{\text{ref}}}\|italic_R start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - ∥ bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ∥ (24)

Another kind of reward design is a soft mechanism based on an acceptable range:

RV(𝒔,𝒂)=i𝒩([viv¯]++[v¯vi]+)superscript𝑅V𝒔𝒂subscript𝑖𝒩subscriptdelimited-[]subscript𝑣𝑖¯𝑣subscriptdelimited-[]¯𝑣subscript𝑣𝑖R^{\text{V}}(\bm{s},\bm{a})=-\sum_{i\in\cal N}\big{(}[{v}_{i}-\overline{v}]_{+% }+[\underline{v}-{v}_{i}]_{+}\big{)}italic_R start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( [ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_v end_ARG ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + [ under¯ start_ARG italic_v end_ARG - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) (25)
Constraint

The constraint for the active and reactive power injections of DGs is given by:

(𝒑DG)2+(𝒒DG)2(𝒔¯apDG)2superscriptsuperscript𝒑DG2superscriptsuperscript𝒒DG2superscriptsuperscriptsubscript¯𝒔apDG2(\bm{p}^{\text{DG}})^{2}+(\bm{q}^{\text{DG}})^{2}\leq(\bar{\bm{s}}_{\text{ap}}% ^{\text{DG}})^{2}( bold_italic_p start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( bold_italic_q start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT ap end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (26)

However, [97] points out that the stability regions are more constrained than in Equation (26). For simplicity, we omit the specific equations. Figure 9 illustrates the piece-wise linear equations that constrain the battery system’s active and reactive power injections within the blue feasible region, while the solar panel inverters are only in the right region, as they do not have a discharging process, i.e., p0𝑝0p\geq 0italic_p ≥ 0.

IV-B2 Volt/Var Control with LinDistFlow Constraints

The LinDistFlow linearized branch flow model is applied within a tree-structured distribution network. The system consists of a set of nodes 𝒩+0={0,1,,N}subscript𝒩001𝑁\mathcal{N}_{+0}=\{0,1,\cdots,N\}caligraphic_N start_POSTSUBSCRIPT + 0 end_POSTSUBSCRIPT = { 0 , 1 , ⋯ , italic_N } and an edge set \mathcal{E}caligraphic_E. Node 0 is known as the substation, and 𝒩=𝒩+0/{0}𝒩subscript𝒩00\mathcal{N}=\mathcal{N}_{+0}/\{0\}caligraphic_N = caligraphic_N start_POSTSUBSCRIPT + 0 end_POSTSUBSCRIPT / { 0 } denotes the set of nodes excluding the substation node. Each node i𝒩𝑖𝒩i\in\mathcal{N}italic_i ∈ caligraphic_N is associated with an active power injection pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a reactive power injection qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the squared voltage magnitude, and let p,q𝑝𝑞p,qitalic_p , italic_q and V𝑉Vitalic_V denote {pi,qi,Vi}i𝒩subscriptsubscript𝑝𝑖subscript𝑞𝑖subscript𝑉𝑖𝑖𝒩\{p_{i},q_{i},V_{i}\}_{i\in\mathcal{N}}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT stacked into a vector. The variables satisfy the following equations, i𝒩for-all𝑖𝒩\forall i\in\mathcal{N}∀ italic_i ∈ caligraphic_N,

pisubscript𝑝𝑖\displaystyle p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =pji+k:(i,k)pikabsentsubscript𝑝𝑗𝑖subscript:𝑘𝑖𝑘subscript𝑝𝑖𝑘\displaystyle=-p_{ji}+\sum_{k:(i,k)\in\mathcal{E}}p_{ik}= - italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k : ( italic_i , italic_k ) ∈ caligraphic_E end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT (27a)
qisubscript𝑞𝑖\displaystyle q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =qji+k:(i,k)qikabsentsubscript𝑞𝑗𝑖subscript:𝑘𝑖𝑘subscript𝑞𝑖𝑘\displaystyle=-q_{ji}+\sum_{k:(i,k)\in\mathcal{E}}q_{ik}= - italic_q start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k : ( italic_i , italic_k ) ∈ caligraphic_E end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT (27b)
visubscript𝑣𝑖\displaystyle v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =vj2(rijpji+xjiqji)absentsubscript𝑣𝑗2subscript𝑟𝑖𝑗subscript𝑝𝑗𝑖subscript𝑥𝑗𝑖subscript𝑞𝑗𝑖\displaystyle=v_{j}-2(r_{ij}p_{ji}+x_{ji}q_{ji})= italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - 2 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ) (27c)

where j𝑗jitalic_j is the parent node of i𝑖iitalic_i in the distribution network. (27c) can be written in the vector form:

𝒗=𝐫𝒑+𝐱𝒒+v0𝟏=𝐱𝒒+𝒗env𝒗𝐫𝒑𝐱𝒒subscript𝑣01𝐱𝒒subscript𝒗env\bm{v}=\mathbf{r}\bm{p}+\mathbf{x}\bm{q}+v_{0}\mathbf{1}=\mathbf{x}\bm{q}+\bm{% v}_{\text{env}}bold_italic_v = bold_r bold_italic_p + bold_x bold_italic_q + italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_1 = bold_x bold_italic_q + bold_italic_v start_POSTSUBSCRIPT env end_POSTSUBSCRIPT (28)

where 𝒗env=𝐫𝒑+v0𝟏subscript𝒗env𝐫𝒑subscript𝑣01\bm{v}_{\text{env}}=\mathbf{r}\bm{p}+v_{0}\mathbf{1}bold_italic_v start_POSTSUBSCRIPT env end_POSTSUBSCRIPT = bold_r bold_italic_p + italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_1 represents the component that cannot be controlled; 𝐫=[2rij]N×N𝐫superscriptdelimited-[]2subscript𝑟𝑖𝑗𝑁𝑁\mathbf{r}=[2r_{ij}]^{N\times N}bold_r = [ 2 italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and 𝐱=[2xij]N×N𝐱superscriptdelimited-[]2subscript𝑥𝑖𝑗𝑁𝑁\mathbf{x}=[2x_{ij}]^{N\times N}bold_x = [ 2 italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT are matrices defined correspond to the parameters rijsubscript𝑟𝑖𝑗r_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, respectively.

State

The state of LinDistFlow is also determined by PMU and AMI measurements, similar to the AC-PF (21).

Action

The control actions is a mapping from the voltage to reactive power, which is defined by:

𝒂tV=Δでるた𝒒t𝒒t𝒒t+1subscriptsuperscript𝒂V𝑡Δでるたsubscript𝒒𝑡subscript𝒒𝑡subscript𝒒𝑡1\bm{a}^{\text{V}}_{t}=\Delta\bm{q}_{t}\triangleq\bm{q}_{t}-\bm{q}_{t+1}bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δでるた bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (29)

The system dynamics can be given as

𝒗t+1=𝐫𝒑+𝐱(𝒒t𝒂tV)+v0𝟏subscript𝒗𝑡1𝐫𝒑𝐱subscript𝒒𝑡subscriptsuperscript𝒂V𝑡subscript𝑣01\bm{v}_{t+1}=\mathbf{r}\bm{p}+\mathbf{x}(\bm{q}_{t}-\bm{a}^{\text{V}}_{t})+v_{% 0}\mathbf{1}bold_italic_v start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_r bold_italic_p + bold_x ( bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_1 (30)

where 𝒑𝒑\bm{p}bold_italic_p lacks a time subscript because it pertains to a fast-response control mechanism, and the active power injection is assumed to be constant.

Reward

The reward is also designed to keep the voltage close to its nominal value (24) or within its maximum and minimum limits (25).

Constraint

The constraints include maximum and minimum value limits and the stability of the action:

𝒂¯V𝒂tV𝒂¯Vsuperscript¯𝒂Vsubscriptsuperscript𝒂V𝑡superscript¯𝒂V\displaystyle~{}\underline{\bm{a}}^{\text{V}}\leq\bm{a}^{\text{V}}_{t}\leq% \overline{\bm{a}}^{\text{V}}under¯ start_ARG bold_italic_a end_ARG start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT ≤ bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_a end_ARG start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT (31a)
𝒂tVis stabilizingsubscriptsuperscript𝒂V𝑡is stabilizing\displaystyle\bm{a}^{\text{V}}_{t}~{}\text{is stabilizing}bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is stabilizing (31b)

IV-B3 Safe RL for Voltage Control

In recent years, the integration of DERs such as rooftop solar panels and EVs has led to rapid and unpredictable fluctuations in the generation and load profiles of distribution systems. These fluctuations pose significant challenges in real-time voltage control for distribution grids. Recently, RL has emerged as a powerful approach for addressing model-free nonlinear control problems, generating considerable interest in developing RL-based controllers to optimize the transient performance of voltage control problems. Safe RL has been effectively implemented to ensure adherence to voltage and transient stability constraints.

In the future, the focus is shifting toward distributed voltage regulation, driven by the limitations of centralized voltage regulation, which requires a central controller and is susceptible to single-point failures and significant communication burdens. Consequently, distributed voltage regulation, which only requires the exchange of local information with neighboring units, has attracted considerable research interest as a promising direction for future development [17].

IV-C Stability Control

TABLE IV: Safe RL Applications in Stability Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [56] Preventive control for transmission overload relief Safety, generation, and network constraints Cum/Hard IPO (III-G) The IPO method’s efficacy is boosted by leveraging spatial-temporal correlations in power grid nodal and edge features. [57] Emergency control for under voltage load-shedding Transient voltage stability Cum/Hard Barrier function (III-G) The safe RL method employs a reward function with a time-dependent barrier function that approaches negative infinity as the system state nears the safety bounds. [103] Emergency load-shedding control Rated capacity, current, voltage and others Cum/Soft Lagrangian relaxation (III-A) Two DRL strategies are designed to tackle intricate power system control challenges in a data-driven manner, aiming to preserve power system stability. [104] Transient and steady-state voltage control Reactive power capacity constraints Ins/Hard Lagrangian relaxation (III-A) and barrier function (III-G) Based on the safe gradient flow framework, the design employs a control barrier function to ensure that given dynamics never leave a safe set. [105] Frequency control Operational constraints Cum/Soft Safety model (III-F) A safety model is proposed comprising two parts: one to check if actions meet safety standards, and another to suggest new actions if they don’t. [106] Minimize the control cost Frequency limit Cum/Hard Barrier function (III-G) A novel self-tuning control barrier function is designed to actively compensate the unsafe frequency control strategies under variational safety constraints. [107] Primary frequency control Frequency constraint Ins/Hard Gauge map (III-F) A closed-form gauge map is proposed, which maps NN outputs from unsafe actions to the set of safe actions.