Reconfigurable Computing Challenge: Real-Time Graph Neural Networks for Online Event Selection in Big Science

Published 11 May 2026 in cs.AR and cs.LG | (2605.10612v1)

Abstract: Graph neural networks are increasingly adopted in trigger systems for collider experiments, where strict latency and throughput constraints render deployment on embedded platforms challenging. As detectors move towards higher granularity, the number of inputs per inference increase and FPGA-only solutions face resource bottlenecks. This work presents an end-to-end demonstrator for the real-time deployment of a dynamic Graph Neural Network for the Belle II electromagnetic calorimeter hardware trigger on the AMD Versal VCK190, leveraging both FPGA fabric and AI Engine tiles. We develop a Python-based semi-automated design flow covering operator fusion, partitioning, mapping, spatial parallelization, and kernel-level optimization. Our design achieves a throughput of 2.94 million events per second at an end-to-end latency of 7.15 microseconds. Compared to the FPGA-only baseline, this represents a 53% throughput improvement while reducing DSP utilization from 99% to 19% at 29% AI Engine tile utilization. To validate the deployment, an interactive visualization pipeline enables real-time monitoring of inference results on the physical demonstrator.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper details a heterogeneous deployment architecture using FPGA and AIE tiles to meet sub-10 μs latency and high throughput requirements.
It introduces a Python-based semi-automated design flow incorporating operator fusion, hardware-aware partitioning, and kernel-level optimizations for spatial parallelism.
Empirical results show a 53% throughput improvement and significant DSP resource reduction, establishing a scalable template for future low-latency GNN workloads.

Real-Time GNN Inference on Heterogeneous SoCs for High-Rate Event Selection

Background and Motivation

The deployment of graph neural networks (GNNs) in hardware trigger systems for particle physics experiments imposes strict real-time performance constraints. High-granularity detectors and increased data rates drive requirements for sub-10 μs end-to-end latency and throughput on the order of tens of MHz, with minimal tolerance for variability and strong in-order event matching. Traditional FPGA-based solutions are increasingly challenged by resource bottlenecks as detector complexity grows, rendering them inadequate for scalable online event selection.

This work addresses these limitations by proposing and demonstrating a heterogeneous deployment architecture that leverages both the programmable logic (FPGA fabric) and the AI Engine (AIE) tiles on AMD Versal SoCs. The case study centers on the Belle II electromagnetic calorimeter (ECL) trigger, anticipating future upgrades that will increase per-event input dimensionality by several orders of magnitude.

Contributions and Methods

A Python-based semi-automated design flow is introduced for mapping pretrained, quantized GNN models—specifically, the CaloClusterNet—onto heterogeneous hardware resources. The workflow involves the following key stages: operator fusion, hardware-aware partitioning, mapping to reusable hardware templates, spatial parallelization, and kernel-level fine-tuning.

Operator Fusion: Recurrent linear and ReLU layers are merged, as are parallel branches, to minimize dataflow graph complexity and mitigate memory buffer limitations on AIE tiles.
Partitioning: Compute operators are greedily assigned to AIE tiles whenever feasible, exploiting their superior performance-per-area for regular operations (Linear, Dense, ReLU, Concat), while graph-specific, data-dependent operators (GravNetConv, Condensation Point Selection) remain on FPGA fabric.
Mapping and Legalization: Operator kernels are instantiated from an open-source C++ library for AIEs and high-level synthesis (HLS) templates for FPGAs. Intermediate representations enforce tensor layout consistency.
Spatial Parallelization: Chains of fully spatially separable operators are replicated to maximize pipeline concurrency, exhaustively searching for minimal replication required to meet throughput.
Kernel-Level Optimization: The authors replace standard loop pipelining with loop flattening in AIE kernels, trading increased code size for improved microsecond-scale turnaround, crucial for low-latency operation.

System Implementation

The demonstrator architecture implemented on the AMD Versal VCK190 integrates three hardware partitions: an Arm CPU hosts supervisory control and a visualization server, while the data path alternates between FPGA and AIE compute units. DMA and memory interfaces isolate data movement from computation, maintaining a decoupled pipeline amenable to real-time streaming.

Performance and correctness are validated across software simulation (QKeras), hardware emulation, and on-device measurements, ensuring strict agreement and reproducibility.

Performance Evaluation

Three design increments are analyzed:

Design 1: Basic partitioning, minimal optimization,
Design 2: Operator fusion and spatial parallelization,
Design 3: Additional kernel-level optimization.

Empirical evaluation demonstrates that, after optimization (Design 3), the heterogeneous solution achieves:

Throughput: 2.94 million events per second (53% increase over the FPGA-only baseline).
End-to-end Latency: 7.15 μs (18% increase over the FPGA-only baseline).
Resource Utilization: Digital signal processor (DSP) usage is dramatically reduced from 99% (FPGA-only) to 19% (FPGA+AIE, Design 3), with 29% of AIE tiles allocated, demonstrating relief of critical resource bottlenecks.
Precision Tradeoffs: Strategic assignment of 16-bit and 8-bit quantization across the data path maintains inference accuracy at system boundaries.

Implications and Future Directions

The presented design flow and heterogeneous deployment provide a scalable and resource-efficient pathway for the implementation of real-time GNN inference in high-throughput scientific instrumentation. The integration of spatial parallelization, operator fusion, and kernel-level optimization on AIEs establishes a new baseline for throughput under tight area constraints. Open-source kernel templates and public device images enhance reproducibility and serve as a reference for future system designers targeting AI-accelerated SoCs.

Key implications include:

Scalability: The approach generalizes to future detector upgrades and other low-latency streaming applications where structured sparsity and large input cardinality prevail.
Hardware-Software Co-Design: The fusion of quantization-aware training, hardware-partitioned dataflow graphs, and low-level microarchitectural tuning underscores the importance of joint optimization.
Platform Generalization: Although demonstrated on the AMD Versal VCK190, the methodology is applicable to a broader class of AI-accelerated reconfigurable technologies.

There exist potential areas for further extension, such as adaptive runtime reconfiguration, dynamic resource allocation for varying event topologies, and the support for more complex or evolving GNN architectures (e.g., attention-based models) as AI Engine capabilities mature. The integration of monitoring and interactive visualization paves the way for robust real-time system diagnostics and transparent operation in production deployments.

Conclusion

This work establishes a robust, high-throughput methodology for online GNN event selection in large-scale scientific experiments, demonstrating significant throughput gains and resource savings by leveraging heterogeneous SoC architectures. The semi-automated deployment flow and optimized GNN kernels position this approach as a practical template for future low-latency scientific AI workloads on emerging hardware platforms.

Markdown Report Issue