Field Deployment of Multi-Agent Reinforcement Learning Based Variable Speed Limit Controllers

Yuhang Zhang^†, Zhiyao Zhang^†, Marcos Quiñones-Grueiro^†, William Barbour^†,
Clay Weston^‡, Gautam Biswas^†, Daniel Work^† ^†Institute for Software Integrated Systems, Vanderbilt University, Nashville TN, 37212, USA. ^‡Southwest Research Institute, San Antonio, TX, 78238, USA. The authors are grateful to Caliper for technical support on the TransModeler micro-simulation software used in this work. The authors would like to thank the Tennessee Department of Transportation (TDOT), Southwest Research Institute (SwRI), Arcadis, and Stantec, who assisted in this deployment, and Caleb Van Geffen and Josh Scherer from Vanderbilt University for their development work of AI-DSS. The contents of this report reflect the views of the authors, who are responsible for the facts and accuracy of the information presented herein. This work is supported by a grant from the U.S. Department of Transportation Grant Number 693JJ22140000Z44ATNREG3202. This material is based upon work supported by the National Science Foundation under Grant No. CNS-2135579 (Work). The U.S. Government assumes no liability for the contents or use thereof.

Abstract

This article presents the first field deployment of a multi-agent reinforcement-learning (MARL) based variable speed limit (VSL) control system on the I-24 freeway near Nashville, Tennessee. We describe how we train MARL agents in a traffic simulator and directly deploy the simulation-based policy on a 17-mile stretch of Interstate 24 with 67 VSL controllers. We use invalid action masking and several safety guards to ensure the posted speed limits satisfy the real-world constraints from the traffic management center and the Tennessee Department of Transportation. Since the time of launch of the system through April, 2024, the system has made approximately 10,000,000 decisions on 8,000,000 trips. The analysis of the controller shows that the MARL policy takes control for up to 98% of the time without intervention from safety guards. The time-space diagrams of traffic speed and control commands illustrate how the algorithm behaves during rush hour. Finally, we quantify the domain mismatch between the simulation and real-world data and demonstrate the robustness of the MARL policy to this mismatch.

I Introduction

Infrastructure-based traffic control systems have been pivotal in managing road traffic long before the advent of vehicle-based control technologies. These systems, including traffic signal control, ramp metering, and variable speed limit (VSL), form the backbone of efforts to streamline traffic flow and enhance safety on roadways. VSL is unique in that it controls the mainline freeway flow by altering speed limits in response to real-time traffic conditions, thus aiming to reduce congestion and accidents [1, 2].

Historically, most deployed VSL control systems have been rule-based [3, 4]. These systems dynamically adjust speed limits based on predefined traffic characteristics, such as flow and density thresholds. The simplicity of rule-based systems contributes to their widespread adoption, as they do not require complex computational resources or extensive training data. However, such simplicity can be a limitation; they may not adapt to unforeseen traffic scenarios or optimize for multiple conflicting objectives, such as minimizing travel time while maximizing safety.

Refer to caption — Figure 1: The MARL-based VSL control system on I-24 Westbound: This figure shows a consecutive four gantries from a driver’s perspective when approaching a congestion tail. As drivers proceed, they encounter progressively reduced speed limits of 60, 50, 40, and 30 mph displayed on each gantry, sequentially alerting them to the upcoming slow-down pattern.

Reinforcement learning (RL) is an emerging approach for decision and control in variety of applications ranging from strategic game playing, industry robotics and complex decision-making [5]. Within the realm of traffic control, the ability of RL to learn and adapt from interaction with an environment makes it potentially promising for managing the dynamic and often unpredictable nature of road traffic.

Earlier studies have applied RL to VSL control in simulated environments, demonstrating its potential to outperform traditional methods by adapting to evolving traffic conditions and optimizing for multiple objectives simultaneously [6]. These results are promising, yet the transition from simulated environments to real-world applications is unexplored. This gap represents a critical barrier: while simulation offers a controlled setting to fine-tune algorithms, real-world traffic presents additional complexities such as varying driver behavior, diverse vehicle types, and unpredictable weather conditions, all of which can affect the performance of RL-based strategies. Consequently, real deployments can offer further insights about the potential of RL to work in operational traffic management systems.

The main contribution of this work is to describe and provide a preliminary assessment of the first field deployment of a multi-agent reinforcement learning (MARL) based VSL control system encompassing 67 VSL gantries on a 17-mile (each direction) segment of Interstate-24 (I-24) near Nashville, Tennessee, USA (Figure 1). Figure 2 overviews the deployment pipeline of our MARL-based VSL controllers. Specifically, our contributions are as follows:

•

We train a MARL-based policy in a simulated environment where homogeneous agents are exposed to diverse traffic scenarios. The optimal policy, once derived, is subsequently tested in a different simulated environment with variable system parameters to assess its robustness and adaptability.
•

We refine the optimal policy derived from simulation by incorporating invalid action masking and several safety guards designed to meet real-world constraints.
•

We deploy the MARL-based VSL control algorithm in the field. Evaluation results indicate that the MARL-based policy autonomously makes up to 98% of the final decisions without any intervention from the safety guards.

Since its initial deployment on March 8, 2024, the MARL-based VSL control system has operated continuously, making decisions at 30-second intervals, 24 hours a day. It has generated over 10 million control decisions, impacting more than 8 million trips through the corridor. It continues to operate today.

The remainder of the article is organized as follows. Section II reviews the related works on VSL field deployments and RL-based controller design. Section III presents our processes to train in simulation and deploy on the live I-24 VSL system. Section IV describes the setup of I-24 where the VSL controllers are deployed. Section V provides the preliminary results of the deployment. Section VI concludes the paper and provides the future directions.

II Related Work

II-A VSL Field Deployments

VSL systems were first proposed and deployed in the 1960’s. Since then, various VSL control systems have been implemented across Europe, Australia, New Zealand, and North America [1, 7, 4, 8]. These deployments have demonstrated benefits in enhancing traffic safety and homogenizing traffic. For instance, a study in Belgium observed an 18% reduction in injury crashes and a 20% reduction in rear-end collisions following the implementation of VSL [9].

Most VSL systems employ rule-based control algorithms, where speed limits are dynamically adjusted based on predefined thresholds related to traffic characteristics. Due to its simplicity, this approach has been widely adopted in numerous deployments [3, 10]. Although model-based control algorithms have been proposed [11, 12], only a few have undergone empirical validation in real-world settings. Notably, the SPECIALIST algorithm, which is based on shock wave theory, demonstrated in simulations the capability to reduce travel times. Subsequently, it was implemented on a 14 km segment of the A12 freeway in the Netherlands, resolving shock waves in nearly 80% of cases when activated [13, 14]. Another instance is the implementation of a model predictive control (MPC) based VSL algorithm on Whitemud Drive in Edmonton, Canada, where preliminary results indicated improved average travel speeds [15].

II-B RL-based VSL Control

Over the past decade, RL-based VSL control algorithm design has gained significant attention within the traffic community [6] due to its ability to manage complex dynamic systems. The authors in [16] proposed a Q-learning based algorithm to enhance traffic efficiency, training a single VSL controller in a simulation setting. A lane-dependent VSL approach was explored in [17], where the authors evaluated an actor-critic based algorithm with various reward designs, including travel time, safety, and pollution.

In recent years, MARL has proven to be an effective approach for control in multi-agent systems. The study [18] developed a cooperative VSL control system using distributed RL within a vehicle-to-infrastructure (V2I) environment to optimize freeway traffic mobility and safety, significantly reducing total travel time and speed variances between freeway segments, which indicates a lower risk of rear-end collisions. In [19], the authors employed the MADDPG algorithm across four VSL controllers in a network with consecutive bottlenecks, designing a reward function to maintain bottleneck density below critical levels to avoid capacity drops and enhance traffic flow. Moreover, our previous work [20] introduced MARVEL, a MARL framework designed for large-scale VSL control across extensive freeway corridors. The policy derived from the MARVEL framework exhibited superior traffic mobility performance compared to baseline algorithms and demonstrated generalizability across varying traffic networks, demands, and compliance rates.

The above-mentioned methods have demonstrated remarkable performance in traffic simulators; however, none have been deployed on real freeway systems to validate their effectiveness.

II-C RL-based Field Experiments in Transportation

Most existing RL-based field experiments have focused on connected and automated vehicles (CAVs) due to their notable potential to stabilize traffic flow and reduce energy consumption [21]. Jang et al. [22] conducted zero-shot policy transfer experiments on a scaled testbed, finding that a policy with noise injected into the state and action space could achieve a 5% reduction in travel time in a roundabout scenario, compared to a policy without noise injection. Chalaki et al. [23] extended this method by integrating adversarial learning during training, demonstrating that the adversarially trained policy outperforms the Gaussian noise injection approach.

Lichtlé et al. [24] developed a pipeline that bypasses the tedious calibration of simulators by using real-world trajectory data to directly learn controllers. They successfully deployed their controller on actual vehicles in freeway traffic, highlighting its potential for energy savings. In a landmark study in November 2022, Jang et al. [25] deployed RL-based controllers on 100 vehicles driving on I-24 in Nashville, marking the largest field test of automated vehicles aimed at smoothing traffic flow.

III Methods

In this section, we briefly review how we formulate the VSL control into a MARL problem and train the MARL policy in a microscopic traffic simulator as discussed in our previous works [26, 20]. Moreover, to guarantee real-world constraints from the transportation agency, we apply invalid action masking on MARL policy and introduce safety guards to override certain actions. Lastly, we demonstrate the final implemented control algorithm by detailing each step.

III-A Problem Formulation

We consider a large-scale VSL control system where multiple VSL controllers span a long freeway segment with nearly evenly distributed distance. We formulate this problem into a cooperative MARL problem, which can be modeled as a Markov Game, defined as a tuple $\langle\{\mathcal{S}^{i}\}_{i\in\{1,\dots,n\}},\{\mathcal{A}^{i}\}_{i\in\{1,% \dots,n\}},\{\mathcal{R}^{i}\}_{i\in\{1,\dots,n\}},P,n,\gamma\rangle$ γがんま ⟩ for a total of $n$ agents, where $\mathcal{S}^{i}$ denotes the local state space for agent $i$ , $\mathcal{A}^{i}$ denotes the action space for agent $i$ , $\mathcal{R}^{i}$ : $\{\mathcal{S}^{i}\}_{i\in\{1,\dots,n\}}\times\{\mathcal{A}^{i}\}_{i\in\{1,% \dots,n\}}\times\{\mathcal{S}^{i}\}_{i\in\{1,\dots,n\}}\rightarrow\mathbb{R}$ denotes the reward for agent $i$ , $P$ : $\{\mathcal{S}^{i}\}_{i\in\{1,\dots,n\}}\times\{\mathcal{A}^{i}\}_{i\in\{1,% \dots,n\}}\times\{\mathcal{S}^{i}\}_{i\in\{1,\dots,n\}}\rightarrow[0,1]$ denotes the transition probability of the environment from a given state to the next state. The goal of MARL for each agent is to learn a policy that maximizes its own cumulative discounted reward:

\displaystyle J^{i}(\theta_{1},\dots,\theta_{n})=\mathbb{E}_{S_{t},A_{t}}\left% [\sum_{t=0}^{T}\gamma^{t}r_{t}^{i}\right],

θしーた start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θしーた start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γがんま start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] ,

(1)

where $S_{t}$ denotes the global state concatenating all local states at time $t$ , $A_{t}=(a_{t}^{1},\dots,a_{t}^{n})$ denotes the joint action of all agents at time $t$ and $r_{t}^{i}$ the reward of agent $i$ at time $t$ . For the MARL components, we adopt the following system:

Agent: each VSL controller on a highway gantry is represented by an agent. To improve the scalability of the system, we consider the agents as homogeneous and they share the same parameters.

State Space: $s^{i}_{t}=\langle a_{t}^{i-1},\nu_{t}^{i},o_{t}^{i},\nu_{t}^{i+1},o_{t}^{i+1}\rangle$ νにゅー start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_νにゅー start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ⟩, where $a_{t}^{i-1}$ is the closest downstream agent’s intended action at time $t$ , and $\nu_{t}^{i},o_{t}^{i},\nu_{t}^{i+1},o_{t}^{i+1}$ νにゅー start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_νにゅー start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT the average traffic speed, the average traffic occupancy from traffic sensor assigned to the agent $i$ and the closest upstream agent $i+1$ . All these input features are normalized to $[0,1]$ . We assume $a_{t}^{i-1}$ is the default maximum speed limit for $i=1$ (the most downstream agent).

Action Space: $A^{i}=\langle 30,40,50,60,70\rangle$ , which is a set of speed limit values that satisfy field deployment requirements.

Reward: the reward function encompasses three terms, namely, adaptability, safety, and mobility. The adaptability term is used to penalize an agent posting high-speed limits when the traffic is in congestion and is used to help the agent to identify the congestion state. The safety term encourages the agents to coordinate with each other to generate a slow-down speed profile upstream of the congestion tail. The mobility term encourages the agents to post a higher speed limit when traffic condition allows. Finally, the reward function for agent $i$ at time $t$ is the following:

\displaystyle r_{t}^{i}

\displaystyle=w_{a}r_{t}^{i,a}+w_{s}r_{t}^{i,s}+w_{m}r_{t}^{i,m}

(2)

where $r_{t}^{i,a},r_{t}^{i,s},r_{t}^{i,m}$ represents the adaptability, safety and mobility terms, respectively, and $w_{a},w_{s},w_{m}$ represent the corresponding coefficients. For more details on the structure and design of the reward function, please refer to our previous work [20].

III-B Training and Testing in Simulation

We use the microscopic simulation software TransModeler for all simulations used to train our VSL controllers. Transmodeler allows driver compliance with the regulatory VSL system to be modeled. We set the compliance rate of 5% for the training scenarios as we expect the compliance rate on the freeway to be relatively low.

We train our policy using the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm [27]. The training scenario is a seven-mile long freeway segment with four lanes on I-24 westbound in Nashville, USA. We implement eight VSL controllers at half-mile intervals upstream of an on-ramp merging area aimed at learning a cooperative policy with varying traffic conditions. A traffic sensor is co-located with each VSL controller to capture the traffic characteristics, with data updated every minute.

To induce traffic congestion, we set a single two-lane on-ramp merging area with a flow around 1000 veh/lane/hr. The simulation spans two hours, during which the mainstream inflow is initially set at 1850 veh/lane/hr for the first hour to induce congestion. For the second hour, this rate is reduced to half to alleviate the congestion. These variations in traffic speed and traffic density create a variety of traffic conditions, offering the homogeneous agents an extensive range of scenarios to navigate.

We test the learned policy on a 17-mile segment of I-24 in simulation, which replicates half of the targeted field network. We focus on the westbound traffic encompassing 34 VSL controllers with one traffic sensor placed 0 to 0.2 miles downstream of each VSL controller, replicating the real conditions on I-24. We consider three testing scenarios including multiple congestion and various compliance rates. Our previous results in [20] demonstrate that the learned policy is able to scale to a greater number of VSL controllers and generalize to new environments with different traffic settings from the training scenario. The traffic scenarios under the control of the learned policy exhibit a superior mobility performance compared to a state-of-the-practice control algorithm that was initially deployed on I-24, while maintaining a lower speed variation to improve safety.

III-C Real-World Constraints

In this section, we detail the real-world constraints pertinent to the intended deployment of the VSL control algorithm, along with our proposed solutions to ensure that our final control algorithm meets these criteria.

III-C1 Maximum Step-Down Constraint

The Manual on Uniform Traffic Control Devices (MUTCD) specifies a maximum permissible speed limit differential of 10 miles per hour (mph) between each pair of VSL controllers that are part of a group indicating slowdown traffic patterns [28]. For instance, pointing at the downstream of traffic, a sequence of speed limits set at $[70,60,50]$ mph complies with the regulation but $[70,50,30]$ mph does not. Our safety reward term is designed to promote satisfaction of this constraint but we may still violate it during testing [20].

To ensure adherence to this constraint, we implement a technique known as invalid action masking (IAM) [29]. This technique introduces a masking layer following the output of the policy network during the testing and deployment period, which effectively removes invalid actions. It thereby restricts the sampling process to the subset of valid actions, ensuring compliance with the specified speed limit differential. We define the invalid action set of agent $i$ at time $t$ according to the following equation:

\displaystyle I=\{a|a>a_{t}^{i-1}+a_{\text{diff}}\}

(3)

where $a_{t}^{i-1}$ is the closest downstream agent’s intended action at time $t$ , $a_{\text{diff}}$ is the maximum permissible speed limit differential for slowing down, which is $10$ mph.

III-C2 Speed-Matching Constraint

As an operating requirement from the transportation authority, the posted speed limits should not significantly deviate from actual traffic speeds. This requirement allows the speed limits to be easily explained to motorists, even if it prevents more exotic wave dissipation designs from being implemented.

Proposed Safety Guard: To align with this requirement, we implemented a mapping function applied to certain outputs generated by the learned policy. This function is defined as follows:

\displaystyle V=\begin{cases}\text{clip}(30,\text{min}(a_{t}^{i-1}+a_{\text{% diff}},f(\nu_{t}^{i})),70)&\text{if $a_{t}^{i}=30$}\\ \text{clip}(30,f(\nu_{t}^{i}),70)&\text{if $a_{t}^{i}=70$ and $o_{t}^{i}\geq o% _{\text{thred}}$}\end{cases}

νにゅー start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) , 70 ) end_CELL start_CELL if italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 30 end_CELL end_ROW start_ROW start_CELL clip ( 30 , italic_f ( italic_νにゅー start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , 70 ) end_CELL start_CELL if italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 70 and italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≥ italic_o start_POSTSUBSCRIPT thred end_POSTSUBSCRIPT end_CELL end_ROW

(4)

where $\text{clip}(a,\cdot,b)$ is a clip function with minimum bound $a$ and maximum bound $b$ , $f(\cdot)$ is a mapping function to map the input to the nearest multiple of 10 that is greater than the input, $o_{\text{thred}}$ is occupancy threshold, which is used to determine whether to apply this mapping when agents are selecting 70 mph.

III-C3 Maximum Speed-Limit Constraint

The maximum allowable speed limit on a freeway segment is determined by several factors, including geometric design and safety considerations. Consequently, maximum speed limits may vary across different segments. Currently, the majority of segments within the targeted field network are subject to a maximum speed limit of 70 mph, while others are capped at 65 or 55 mph.

Proposed Safety Guard: To ensure adherence to this constraint while maintaining homogeneous MARL settings for scalability, we apply a clip function to assure the posted speed limit is within the allowable range. Specifically, for any generated speed limit $V$ , we apply the following equation:

\displaystyle V^{\prime}=\text{min}(V,V_{\text{max}})

(5)

where $V_{\text{max}}$ is the allowable maximum speed limit and $V^{\prime}$ is the clipped speed limit that satisfies this constraint.

III-C4 Debounce Constraint

A bounce is defined as a spatial sequence of speed limits at the same timestamp where all intermediate speed limits are higher than both the first and last speed limits in the sequence, which are referred to as boundary speed limits. The order of a bounce is defined as the number of intermediate speed limits in the bounce sequence. For instance, within the direction of traffic flow, a sequence of $[30,60,50]$ is a bounce with order $1$ while a sequence of $[30,60,50,40]$ is a bounce with order $2$ . As per local design requirements, the deployed algorithm should not generate any bounce with order $1$ .

Proposed Safety Guard: To align with this requirement, we iterate all intended speed limits and identify every bounce with order of $1$ . We apply the following equation to override the intermediate speed limit of each identified bounce:

\displaystyle V^{\prime\prime}=\text{min}(V_{d}^{\prime},V_{u}^{\prime})

(6)

where $V_{d}^{\prime},V_{u}^{\prime}$ are the two boundary speed limits and $V^{\prime\prime}$ is the corrected speed limit that satisfies debounce constraint.

III-D Algorithm Integration

In this section, we explain the general pipeline of the deployed algorithm, from data preprocessing to the final outputs. The architecture of the deployed algorithm for a set of gantries in one direction of travel is shown in Figure 3. This algorithm has four steps as follows:

•

Step 1: Process all sensor data to interpolate missing values and to determine the critical downstream sensor for each VSL controller. This critical sensor will be used to provide state inputs in Step 2.
•

Step 2: For each VSL controller, evaluate the MARL policy with all state inputs as described in Section III. With invalid action masking, the output of the policy network ensures the maximum step-down constraint. This output will go through the speed-matching module for any necessary adjustments. The updated output will then be used as a part of the state inputs to feed the upstream VSL controllers. The VSL controllers are processed in order starting with the most downstream gantry first, and the output of this step is a set of initial speed limits that are corrected in later steps. This step is responsible for satisfying the maximum step-down and speed-matching constraints.
•

Step 3: Process all VSL controllers (starting from the most downstream gantry) to make maximum speed limit corrections according to (5). This step is responsible for satisfying maximum speed limit constraint.
•

Step 4: Process all VSL controllers again (starting from the most downstream gantry) to identify if any debounce constraints are violated, and correct them with the debounce logic in (6) to generate the final speed limits to be posted. This step is responsible for satisfying debounce constraint.

IV Experimental Setup

In this section, we provide a detailed overview of the deployment, which is known as the I-24 SMART Corridor. We also describe the software infrastructure, which is known as the Artificial Intelligence Decision Support System (AI-DSS). The AI-DSS supports the implementation of our MARL-based VSL control algorithm, amongst other decision support functionalities not described in this work.

IV-A I-24 SMART Corridor

The I-24 SMART Corridor is the first Integrated Corridor Management (ICM) project in Tennessee, and it includes a set of strategies to manage traffic on freeways and arterials between downtown Nashville and the city of Murfreesboro. The freeway segment experiences an Annual Average Daily Traffic (AADT) in excess of 160,000 vehicles, with peak hours marked by significant congestion and frequent stop-and-go patterns [30].

To improve traffic safety and travel time reliability, I-24 SMART Corridor integrates multiple Active Traffic Management (ATM) strategies, including VSL, a lane control system, and arterial signal integration. Currently, I-24 SMART Corridor has deployed 34 VSL gantries on I-24 westbound and 33 VSL gantries on I-24 eastbound, spanning 17 miles from mile marker 53 to mile marker 70. In this area, 60 Radar Detection System (RDS) sensors have been installed or upgraded to monitor traffic performance and provide state inputs to our MARL-based control algorithm at a 30-second interval, which is shorter than we assumed in training as the data rates changed as the system evolved. Figure 4 displays a map of the VSL deployment segment.

IV-B AI-DSS

To implement MARL-based controllers into the production active traffic management software (SmartWay CS) used in the regional Traffic Management Center (TMC), we created a software stack known as the AI-DSS [31]. Figure 6 presents the workflow of the communications between AI-DSS and the TMC. The TMC operator monitors the corridor conditions and records relevant incident information in SmartWay CS. An API in SmartWay CS allows bidirectional communications with the AI-DSS over the TCP/IP protocol using websockets. Based on the real-time traffic information from SmartWay CS, the AI-DSS implements the MARL-based control algorithm and provides the speed limits to be posted back to SmartWay CS. SmartWay CS verifies that the speed limits do not violate any constraints, and posts the speed limits to the gantries on the roadway.

The AI-DSS is implemented in Python for its extensive support for libraries enabling multi-processing, websocket connectivity, database logging, and the execution of MARL-based policies. Currently, five separate environments are designated for the AI-DSS: development, testing, production mirror at Vanderbilt, demo, and production at Tennessee Department of Transportation (TDOT). The development environment is utilized for debugging with real-time data. After comprehensive testing of the AI-DSS with the MARL-based VSL control algorithm, it is deployed to the TDOT production environment for real-time traffic control.

V Results

In this section, we present the MARL-based VSL control algorithm deployment results. First, we show the control algorithm behavior from a random morning peak hour. Next, we display the effective control time of the MARL-based policy (with IAM) and the safety guards in the algorithm. Finally, we quantify the domain mismatch between the simulation and real-world observations and demonstrate the robustness of the learned policy.

V-A Algorithm Behavior

Figure 5 (a) shows the time-space diagram of the average traffic speed of the morning peak hour of I-24 Westbound on Monday, April 22, 2024. The x-axis represents time and y-axis represents mile markers of the 17-mile segment of I-24, where the traffic is going upward along y-axis to downtown Nashville. With colors denoting the traffic speed recorded by RDS sensors, Figure 5 (a) exhibits a typical morning rush hour congestion pattern of the selected I-24 segment, with the first congestion wave occurring at 5:30 AM.

Figure 5 (b) displays the time-space diagram for 34 VSL gantries on I-24 Westbound, which are controlled by the MARL-based VSL algorithm described in Section III at 30-second intervals, with the same time and space ranges as shown in Figure 5 (a). Note that there are 6 consecutive VSLs closest to downtown Nashville with a smaller maximum speed limits than the rest of the VSLs, as determined by TDOT to improve traffic safety. To take a closer look at the role of MARL-based policy (with IAM) in our control algorithm, Figure 5 (c) presents the same diagram as Figure 5 (b) but with all safety guards masked as white.

The behavior of the algorithm can be described based on three different traffic regimes: congestion regime, free-flow regime, and transition regime. As shown in Figure 5 (a), the congestion regime can be identified as the dark red area, the free-flow regime as the dark green area, and the transition regime as the alternating yellow, orange and shallow green area. Specifically, the MARL-based policy (with IAM) is able to identify the congestion and free-flow regimes for most times due to the adaptability and mobility reward terms and the informative state space design. As for transition regime, we can divide it into three categories, namely, free-flow to congestion (F-C), congestion to congestion (C-C), and congestion to free-flow (C-F). With a comparison between Figure 5 (c) and Figure 5 (a), we can observe that the MARL-based policy (with IAM) can generate a smooth slow-down speed profile for F-C thanks to the safety reward term and the involvement of the invalid action masking. However, we still need Speed-Matching and Debounce safety guards to work for C-C and C-F with a goal to satisfy the authority requirements. Finally, we note that the white part on the top of Figure 5 (c) is because of the 6 VSLs with smaller maximum speed limits, for which the Maximum Speed Limit Correction safety guard has been triggered.

Finally, to better understand how the deployed algorithm behaves from a driver’s perspective, we generate the trajectories of 3 simulated vehicles according to the RDS speed data from Figure 5 (a). Figure 5 (d) shows the time series of travel speed and the corresponding speed limits for each simulated vehicle. Starting from 6 AM, Vehicle 1 encounters multiple stop-and-go waves along its journey. Meanwhile, the VSL can inform Vehicle 1 of the incoming slow-down or speed-up patterns in advance for most times, as we can observe a time lag between the blue-dashed line and the orange line in Figure 5 (d). With a later starting time and a longer travel time, Vehicle 2 and Vehicle 3 encounter a stand-still congestion pattern, during which the VSL behaves in advance to provide the slow-down warning signal for the vehicles, aiming to prevent a sudden break or at least inform the upstream drivers of the downstream traffic condition.

TABLE I: The daily effectiveness percentage (AVG

\pm

STD) of MARL-Policy with IAM (Policy), Speed-Matching (SM), Maximum Speed Limit Correction (MSLC), and Debounce (DB). Note “I” and “E” refer to including gantries with custom max speed limit and excluding gantries with custom max speed limit, “WB” and “EB” refer to “Westbound” and “Eastbound”, “PH” refers to “Peak Hour”.

	Dataset	Policy (%)	SM (%)	MSLC (%)	DB (%)
I	I-24 WB	81.3 $\pm$ 0.8	1.8 $\pm$ 1.1	16.1 $\pm$ 1.1	0.8 $\pm$ 0.5
	I-24 WB PH	78.3 $\pm$ 2.1	7.4 $\pm$ 1.3	10.5 $\pm$ 2.1	3.8 $\pm$ 0.5
	I-24 EB	87.3 $\pm$ 0.9	2.8 $\pm 1.6$	9.6 $\pm$ 1.2	0.3 $\pm$ 0.2
	I-24 EB PH	84.3 $\pm$ 2.9	12.8 $\pm$ 2.7	1.2 $\pm$ 1.2	1.7 $\pm$ 0.6
E	I-24 WB	98.4 $\pm$ 1.0	1.3 $\pm$ 0.8	0	0.3 $\pm$ 0.2
	I-24 WB PH	93.1 $\pm$ 1.5	5.1 $\pm$ 1.1	0	1.8 $\pm$ 0.4
	I-24 EB	97.5 $\pm$ 1.7	2.2 $\pm$ 1.5	0	0.3 $\pm$ 0.2
	I-24 EB PH	86.6 $\pm$ 3.7	11.8 $\pm$ 3.2	0	1.6 $\pm$ 0.7

V-B Control Effectiveness Analysis

Given a dataset collected from March 8, 2024 to April 24, 2024 with 8,923,106 decisions for 67 gantries, Table I displays the amount of time the MARL-based policy with IAM (Policy) is implemented directly, the amount of time that Speed-Matching (SM) is used to correct the Policy and the amount of time that Maximum Speed Limit Correction (MSLC) and Debounce (DB) is used for final adjustments.

On average, the Policy has controlled 81.3% of the time on I-24 Westbound and 87.3% on I-24 Eastbound daily, across all 67 VSL gantries. To further understand the situation during peak hours, we analyzed the morning peak hour on I-24 Westbound and the afternoon peak hour on I-24 Eastbound, during which the Policy has a slightly reduced controlled time. It is worth to note that the gantries with a customized max speed limit will trigger MSLC constantly when traffic is in freeflow. We remove the 10 gantries with a custom max speed limit, i.e., 6 from Westbound and 4 from Eastbound, and show the results in the bottom part of Table I. Among those 57 gantries with the same max speed limit of 70 mph, the Policy has controlled 93.1% of the time for westbound morning peak hour and 86.6% of the time for eastbound afternoon peak hour.

V-C Domain Mismatch Quantification

One of the challenges that impede the successful transfer of simulation-based RL policies to real-world deployment is domain mismatch, where the real-world observations may not overlap with the ones in simulation. To quantify this mismatch, we calculate the Wasserstein Distance [32] among the observation samples between simulation and real world. In detail, we conduct 10 random experiments in the simulation testing environment as described in Section III. We then randomly sample 1000 observation data points of each experiment to generate 10 datasets on the simulation side, i.e., sim1 to sim10. Similarly, we sample 1000 observation samples of I-24 Westbound peak hour from each of the 10 random days to generate real-world datasets.

Figure 7 presents the heatmap of the Wasserstein Distance across every two pairs of the aforementioned 20 datasets. With a symmetric structure, this heatmap demonstrates three facts. First, the observation distributions of different simulation experiments are very close to each other as can be seen from the bottom right part of Figure 7. Second, the observation distribution of real world has a shift from day to day, as seen from top left of Figure 7, indicating a varying traffic pattern with different days. Third, the distance between real-world and simulation observations is larger than that within real-world datasets or simulation datasets. With this domain mismatch, the learned MARL-based policy (with IAM) demonstrates a robust performance, as can be seen from Table I.

VI Conclusions

This work describes the first MARL deployment of a VSL control system on the I-24 freeway in Nashville, TN, which continues to operate today. The preliminary results demonstrate that it is possible to deploy a simulation-based MARL policy in the real world with safety guards. The safety guards are needed but run only a small portion of time compared to the RL policy, demonstrating the potential for further RL-based deployments on infrastructure systems. As the system continues to run, we expect to be able to provide more datasets and analysis of the performance of the algorithm with respect to traditional safety and performance measures on the corridor.

References

[1] X.-Y. Lu and S. E. Shladover, “Review of variable speed limits and advisories: Theory, algorithms, and practice,” Transportation Research Record, vol. 2423, no. 1, pp. 15–23, 2014.
[2] B. Khondaker and L. Kattan, “Variable speed limit: an overview,” Transportation Letters, vol. 7, no. 5, pp. 264–278, 2015.
[3] L. Elefteriadou, S. S. Washburn, Y. Yin, V. Modi, C. Letter, et al., “Variable speed limit (vsl)-best management practice,” tech. rep., University of Florida. Transportation Research Center, 2012.
[4] M. Robinson et al., “Examples of variable speed limit applications: Speed management workshop,” 2000.
[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[6] K. Kušić, E. Ivanjko, M. Gregurić, and M. Miletić, “An overview of reinforcement learning methods for variable speed limit control,” Applied Sciences, vol. 10, no. 14, p. 4917, 2020.
[7] C. Han, J. Luk, V. Pyta, and P. Cairney, Best practice for variable speed limits: Literature review. No. AP-R342/09, 2009.
[8] M. Papageorgiou, E. Kosmatopoulos, and I. Papamichail, “Effects of variable speed limits on motorway traffic flow,” Transportation Research Record, vol. 2047, no. 1, pp. 37–48, 2008.
[9] E. De Pauw, S. Daniels, L. Franckx, and I. Mayeres, “Safety effects of dynamic speed limits on motorways,” Accident Analysis & Prevention, vol. 114, pp. 83–89, 2018.
[10] Y. Zhang, M. Quinones-Grueiro, W. Barbour, C. Weston, G. Biswas, and D. Work, “Quantifying the impact of driver compliance on the effectiveness of variable speed limits and lane control systems,” in 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pp. 3638–3644, IEEE, 2022.
[11] R. C. Carlson, I. Papamichail, M. Papageorgiou, and A. Messmer, “Optimal mainstream traffic flow control of large-scale motorway networks,” Transportation Research Part C: Emerging Technologies, vol. 18, no. 2, pp. 193–212, 2010.
[12] M. Yu and W. Fan, “Optimal variable speed limit control at a lane drop bottleneck: Genetic algorithm approach,” Journal of Computing in Civil Engineering, vol. 32, no. 6, p. 04018049, 2018.
[13] A. Hegyi, S. P. Hoogendoorn, M. Schreuder, H. Stoelhorst, and F. Viti, “Specialist: A dynamic speed limit control algorithm based on shock wave theory,” in 2008 11th international ieee conference on intelligent transportation systems, pp. 827–832, IEEE, 2008.
[14] A. Hegyi and S. P. Hoogendoorn, “Dynamic speed limit control to resolve shock waves on freeways-field test results of the specialist algorithm,” in 13th International IEEE Conference on Intelligent Transportation Systems, pp. 519–524, IEEE, 2010.
[15] X. Wang, M. Seraj, Y. Bie, T. Z. Qiu, and L. Niu, “Implementation of variable speed limits: Preliminary test on whitemud drive, edmonton, canada,” Journal of Transportation Engineering, vol. 142, no. 12, p. 05016007, 2016.
[16] Z. Li, P. Liu, C. Xu, H. Duan, and W. Wang, “Reinforcement learning-based variable speed limit control strategy to reduce traffic congestion at freeway recurrent bottlenecks,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 11, pp. 3204–3217, 2017.
[17] Y. Wu, H. Tan, L. Qin, and B. Ran, “Differential variable speed limits control for freeway recurrent bottlenecks via deep actor-critic algorithm,” Transportation research part C: emerging technologies, vol. 117, p. 102649, 2020.
[18] C. Wang, J. Zhang, L. Xu, L. Li, and B. Ran, “A new solution for freeway congestion: Cooperative speed limit control using distributed reinforcement learning,” IEEE Access, vol. 7, pp. 41947–41957, 2019.
[19] S. Zheng, M. Li, Z. Ke, Z. Li, et al., “Coordinated variable speed limit control for consecutive bottlenecks on freeways using multiagent reinforcement learning,” Journal of advanced transportation, vol. 2023, 2023.
[20] Y. Zhang, M. Quinones-Grueiro, Z. Zhang, Y. Wang, W. Barbour, G. Biswas, and D. Work, “Marvel: Multi-agent reinforcement-learning for large-scale variable speed limits,” arXiv preprint arXiv:2310.12359, 2023.
[21] R. E. Stern, S. Cui, M. L. Delle Monache, R. Bhadani, M. Bunting, M. Churchill, N. Hamilton, H. Pohlmann, F. Wu, B. Piccoli, et al., “Dissipation of stop-and-go waves via control of autonomous vehicles: Field experiments,” Transportation Research Part C: Emerging Technologies, vol. 89, pp. 205–221, 2018.
[22] K. Jang, E. Vinitsky, B. Chalaki, B. Remer, L. Beaver, A. A. Malikopoulos, and A. Bayen, “Simulation to scaled city: zero-shot policy transfer for traffic control via autonomous vehicles,” in Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems, pp. 291–300, 2019.
[23] B. Chalaki, L. E. Beaver, B. Remer, K. Jang, E. Vinitsky, A. M. Bayen, and A. A. Malikopoulos, “Zero-shot autonomous vehicle policy transfer: From simulation to real-world via adversarial learning,” in 2020 IEEE 16th international conference on control & automation (ICCA), pp. 35–40, IEEE, 2020.
[24] N. Lichtlé, E. Vinitsky, M. Nice, B. Seibold, D. Work, and A. M. Bayen, “Deploying traffic smoothing cruise controllers learned from trajectory data,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 2884–2890, IEEE, 2022.
[25] K. Jang, N. Lichtlé, E. Vinitsky, A. Shah, M. Bunting, M. Nice, B. Piccoli, B. Seibold, D. B. Work, M. L. D. Monache, et al., “Reinforcement learning based oscillation dampening: Scaling up single-agent rl algorithms to a 100 av highway field operational test,” arXiv preprint arXiv:2402.17050, 2024.
[26] Y. Zhang, M. Quinones-Grueiro, W. Barbour, Z. Zhang, J. Scherer, G. Biswas, and D. Work, “Cooperative multi-agent reinforcement learning for large scale variable speed limit control,” in 2023 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 149–156, 2023.
[27] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu, “The surprising effectiveness of PPO in cooperative multi-agent games,” in Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
[28] U.S. Department of Transportation, Federal Highway Administration, Manual on Uniform Traffic Control Devices, 2009.
[29] S. Huang and S. Ontañón, “A closer look at invalid action masking in policy gradient algorithms,” arXiv preprint arXiv:2006.14171, 2020.
[30] D. Gloudemans, Y. Wang, J. Ji, G. Zachar, W. Barbour, E. Hall, M. Cebelak, L. Smith, and D. B. Work, “I-24 motion: An instrument for freeway traffic science,” Transportation Research Part C: Emerging Technologies, vol. 155, p. 104311, 2023.
[31] C. M. Van Geffen et al., “System architecture for ai-enabled corridor management,” 2022.
[32] R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. Gayraud, H. Janati, A. Rakotomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer, “Pot: Python optimal transport,” Journal of Machine Learning Research, vol. 22, no. 78, pp. 1–8, 2021.