- The paper details a heterogeneous deployment architecture using FPGA and AIE tiles to meet sub-10 μs latency and high throughput requirements.
- It introduces a Python-based semi-automated design flow incorporating operator fusion, hardware-aware partitioning, and kernel-level optimizations for spatial parallelism.
- Empirical results show a 53% throughput improvement and significant DSP resource reduction, establishing a scalable template for future low-latency GNN workloads.
Real-Time GNN Inference on Heterogeneous SoCs for High-Rate Event Selection
Background and Motivation
The deployment of graph neural networks (GNNs) in hardware trigger systems for particle physics experiments imposes strict real-time performance constraints. High-granularity detectors and increased data rates drive requirements for sub-10 μs end-to-end latency and throughput on the order of tens of MHz, with minimal tolerance for variability and strong in-order event matching. Traditional FPGA-based solutions are increasingly challenged by resource bottlenecks as detector complexity grows, rendering them inadequate for scalable online event selection.
This work addresses these limitations by proposing and demonstrating a heterogeneous deployment architecture that leverages both the programmable logic (FPGA fabric) and the AI Engine (AIE) tiles on AMD Versal SoCs. The case study centers on the Belle II electromagnetic calorimeter (ECL) trigger, anticipating future upgrades that will increase per-event input dimensionality by several orders of magnitude.
Contributions and Methods
A Python-based semi-automated design flow is introduced for mapping pretrained, quantized GNN models—specifically, the CaloClusterNet—onto heterogeneous hardware resources. The workflow involves the following key stages: operator fusion, hardware-aware partitioning, mapping to reusable hardware templates, spatial parallelization, and kernel-level fine-tuning.
- Operator Fusion: Recurrent linear and ReLU layers are merged, as are parallel branches, to minimize dataflow graph complexity and mitigate memory buffer limitations on AIE tiles.
- Partitioning: Compute operators are greedily assigned to AIE tiles whenever feasible, exploiting their superior performance-per-area for regular operations (Linear, Dense, ReLU, Concat), while graph-specific, data-dependent operators (GravNetConv, Condensation Point Selection) remain on FPGA fabric.
- Mapping and Legalization: Operator kernels are instantiated from an open-source C++ library for AIEs and high-level synthesis (HLS) templates for FPGAs. Intermediate representations enforce tensor layout consistency.
- Spatial Parallelization: Chains of fully spatially separable operators are replicated to maximize pipeline concurrency, exhaustively searching for minimal replication required to meet throughput.
- Kernel-Level Optimization: The authors replace standard loop pipelining with loop flattening in AIE kernels, trading increased code size for improved microsecond-scale turnaround, crucial for low-latency operation.
System Implementation
The demonstrator architecture implemented on the AMD Versal VCK190 integrates three hardware partitions: an Arm CPU hosts supervisory control and a visualization server, while the data path alternates between FPGA and AIE compute units. DMA and memory interfaces isolate data movement from computation, maintaining a decoupled pipeline amenable to real-time streaming.
Performance and correctness are validated across software simulation (QKeras), hardware emulation, and on-device measurements, ensuring strict agreement and reproducibility.
Three design increments are analyzed:
- Design 1: Basic partitioning, minimal optimization,
- Design 2: Operator fusion and spatial parallelization,
- Design 3: Additional kernel-level optimization.
Empirical evaluation demonstrates that, after optimization (Design 3), the heterogeneous solution achieves:
- Throughput: 2.94 million events per second (53% increase over the FPGA-only baseline).
- End-to-end Latency: 7.15 μs (18% increase over the FPGA-only baseline).
- Resource Utilization: Digital signal processor (DSP) usage is dramatically reduced from 99% (FPGA-only) to 19% (FPGA+AIE, Design 3), with 29% of AIE tiles allocated, demonstrating relief of critical resource bottlenecks.
- Precision Tradeoffs: Strategic assignment of 16-bit and 8-bit quantization across the data path maintains inference accuracy at system boundaries.
Implications and Future Directions
The presented design flow and heterogeneous deployment provide a scalable and resource-efficient pathway for the implementation of real-time GNN inference in high-throughput scientific instrumentation. The integration of spatial parallelization, operator fusion, and kernel-level optimization on AIEs establishes a new baseline for throughput under tight area constraints. Open-source kernel templates and public device images enhance reproducibility and serve as a reference for future system designers targeting AI-accelerated SoCs.
Key implications include:
- Scalability: The approach generalizes to future detector upgrades and other low-latency streaming applications where structured sparsity and large input cardinality prevail.
- Hardware-Software Co-Design: The fusion of quantization-aware training, hardware-partitioned dataflow graphs, and low-level microarchitectural tuning underscores the importance of joint optimization.
- Platform Generalization: Although demonstrated on the AMD Versal VCK190, the methodology is applicable to a broader class of AI-accelerated reconfigurable technologies.
There exist potential areas for further extension, such as adaptive runtime reconfiguration, dynamic resource allocation for varying event topologies, and the support for more complex or evolving GNN architectures (e.g., attention-based models) as AI Engine capabilities mature. The integration of monitoring and interactive visualization paves the way for robust real-time system diagnostics and transparent operation in production deployments.
Conclusion
This work establishes a robust, high-throughput methodology for online GNN event selection in large-scale scientific experiments, demonstrating significant throughput gains and resource savings by leveraging heterogeneous SoC architectures. The semi-automated deployment flow and optimized GNN kernels position this approach as a practical template for future low-latency scientific AI workloads on emerging hardware platforms.