- The paper introduces CoordLight, leveraging Queueing Dynamic State Encoding (QDSE) and Neighbor-aware Policy Optimization (NAPO) to enhance decentralized traffic control.
- It employs an attention-based spatio-temporal neural architecture to coordinate intersection agents, significantly reducing travel time and variance across urban networks.
- Empirical results on real-world benchmarks show over 6% improvement in performance and demonstrate robust scalability even under sensor noise.
Decentralized Coordination for Large-Scale Traffic Signal Control via Multi-Agent Reinforcement Learning: The CoordLight Framework
Introduction and Motivation
Efficient network-wide adaptive traffic signal control (ATSC) remains a critical bottleneck for sustainable urban mobility. While Multi-Agent Reinforcement Learning (MARL) has enabled the deployment of decentralized policies for traffic networks, effective coordination among agents and robust local traffic state inference persist as unresolved challenges, especially under the constraints of partial observability. The paper "CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control" (2603.24366) introduces CoordLight, an end-to-end learning-based framework that addresses two principal gaps: (1) constructing fine-grained, predictive intersection state representations, and (2) enabling effective, scalable coordination among adjacent traffic signal agents.
Architectural Overview
CoordLight comprises two principal innovations. First, Queueing Dynamic State Encoding (QDSE) provides a comprehensive, lane-level representation incorporating both current and prospective traffic dynamics at each intersection. Second, Neighbor-aware Policy Optimization (NAPO) augments independent actor-critic RL with attention-driven learning over spatial and temporal agent-action dependencies. The system is instantiated via an attention-based spatio-temporal neural architecture, facilitating the extraction and credit assignment of critical neighbor interactions essential for network-scale traffic optimization.
Figure 1: Overall learning architecture of CoordLight, illustrating the integration of QDSE and NAPO for decentralized agent coordination.
Traffic signal control is formalized as a decentralized partially observable Markov decision process (Dec-POMDP), where each intersection is an agent with limited local observations and communication with immediate neighbors. The action space is phase selection for a predefined duration, and the reward is a regionally coordinated negative queue length sum, coupling each agent's objectives with those of neighbors via overlapping incoming and outgoing lanes. This reward structure ensures the observed improvements in both individual intersection efficiency and global traffic metrics.
Figure 2: Intersection operation example: eight-phase signalization with current activation illustrated, reflecting the action space.
Queueing Dynamic State Encoding (QDSE)
The QDSE representation encodes six lane-level feature vectors, encompassing not only queue lengths, entering/leaving vehicle counts, and moving vehicle estimations, but also projections of impending congestion via leading vehicle distances and following platoons. This composite feature vector enables agents to reason proactively about not only present, but also anticipated traffic states, a critical aspect for predictive congestion mitigation.
Figure 3: Lane-level QDSE features for a prototypical incoming lane, including queue lengths, dynamic vehicle counts, and lead vehicle projections.
QDSE supports robust operation under sensor noise, as demonstrated by the consistent performance in simulation experiments with injected Gaussian perturbations. Mild performance degradation (<2.5% increase in travel time at high noise) validates the practicality of this representation for real-world deployments.
Figure 4: QDSE robustness analysis: Average travel time under varying levels of sensor noise on Jinan and Hangzhou datasets.
Neighbor-Aware Policy Optimization (NAPO)
NAPO generalizes decentralized PPO by integrating learnable attention vectors ฮฑ and ฮฒ, which respectively weight neighbor state and action contributions to both actor and critic computations. The actor network employs a multi-head attention-based spatial aggregation unit followed by a GRU-based temporal aggregator, yielding policy decisions conditioned on a learned abstraction of spatial-temporal neighborhood states. The critic network is similarly enhanced, including a state-action decoder to condition value estimates on neighbors' historical state-action sequences, accelerating and stabilizing credit assignment and advantage estimation.
Figure 5: Architecture details of the neighbor-aware actor-critic networks: (a) attention-based spatio-temporal actor, (b) privileged critic with state-action decoding.
Empirical Analysis
Traffic Scenarios and Benchmarks
CoordLight is evaluated in CityFlow-based simulations of three large, real-world urban traffic networks: Jinan (3ร4), Hangzhou (4ร4), and New York (7ร28), covering up to 196 intersections. Diverse traffic demand profiles are examined to test scalability and robustness. Baselines include advanced max-pressure, graph-attention (CoLight), and recent decentralized MARL methods (DenseLight, SocialLight).
Figure 6: CityFlow simulation mapsโJinan, Hangzhou, and New Yorkโutilized for large-scale experimental evaluation.
CoordLight exhibits consistent, significant reductions in average travel time compared to all baselines. For example, on the Jinan datasets, average travel time drops below 200s only for CoordLight; New York results display a โผ6โ9% advantage over SocialLight, which is statistically significant across all traffic scenarios (p-value <10โ8 with Bonferroni correction).
Figure 7: Intersection-level average travel time and variance (lower is better) for CoordLight vs. three strong MARL baselines in high-demand city datasets.
CoordLight also demonstrates lower intersection-level mean and variance of travel times, indicating more equitable and stable coordinationโkey for system-level reliability in heterogeneous or non-stationary traffic.
Ablation Studies
Systematic component ablation highlights the impact of each architectural contribution:
Implications and Future Directions
CoordLight's demonstrated improvement in network-wide traffic metrics underlines the centrality of precise state representations and neighbor-sensitive optimization in large-scale decentralized control. Noteworthy is the demonstrated scalability and statistical significance of improvements over prior art in highly non-stationary, partially-observable domains. The attention mechanisms facilitate targeted coordination, reducing unnecessary inter-agent communication and computation.
The framework's flexibility supports extension to heterogeneous network topologies, asynchronous signal settings, dynamic action spaces (e.g., phase duration control), and imperfect real-world sensing. Moreover, QDSE and NAPO can be integrated into newer hierarchical, meta-learning, or continual learning MARL paradigms for urban-scale adaptive traffic control. Handling priorities (e.g., emergency vehicles), accident-induced structural variations, or robust training under stochastic dynamics constitute promising research avenues.
Conclusion
CoordLight advances decentralized MARL for ATSC via principled, fine-grained state encoding and neighbor-aware optimization, achieving state-of-the-art performance on heterogeneous large-city traffic benchmarks with high sample efficiency and robust coordination. This work provides a solid methodological foundation for scalable, adaptive, and reliable real-world intelligent traffic management systems.