Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain

Published 24 Apr 2026 in cs.CV | (2604.22507v1)

Abstract: Automated train operation on existing railway infrastructure requires robust camera-based perception, yet the railway domain lacks public benchmark suites with standardized evaluation protocols that would enable reproducible comparison of approaches. We present RAIL-BENCH, the first perception benchmark suite for the railway domain. It comprises five challenges - rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry - each tailored to the specific characteristics of railway environments. RAIL-BENCH provides curated training and test datasets drawn from diverse real-world scenarios, evaluation metrics, and public scoreboards (https://www.mrt.kit.edu/railbench). For the rail track detection challenge we introduce LineAP, a novel segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, addressing key limitations of existing line detection metrics.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces RAIL-BENCH, the first benchmark suite tailored for railway perception using domain-specific datasets and metrics.
It evaluates critical tasks such as rail track detection, object recognition, vegetation segmentation, tracking, and odometry with innovative metrics like LineAP.
Empirical results expose domain-shift challenges and underscore the need for in-domain model fine-tuning for robust automated railway operation.

Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): Establishing a Comprehensive Benchmark Suite for Railway Perception

Introduction and Motivation

Automated train operation demands reliable visual perception systems adapted to the complex and distinct operational settings of contemporary railways. Existing research and progress in camera-based perception for railway automation have been limited by the absence of standardized, domain-specific benchmarks and large representative datasets. While the automotive field benefits from benchmark suites such as KITTI, Waymo, and nuScenes, no equivalent has existed for the railway sector. "Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain" (2604.22507) directly addresses this methodological deficit by introducing the first suite for environmental perception in railway environments.

Benchmark Structure and Core Challenges

RAIL-BENCH encompasses five critical perception challenges, reflecting the operational demands unique to railway automation:

Rail Track Detection: Identifying and geometrically localizing rail segments using polylines, to address the highly elongated structure of the rails.
Object Detection: Detecting domain-specific object classes such as trains, signals, poles, road vehicles, bicycles, and persons.
Vegetation Segmentation: Segmenting high- and low-growing vegetation, with a key emphasis on detecting encroachment into safety-relevant clearance gauges.
Multi-Object Tracking: Temporally associating detections of objects—primarily persons—in crowded and cluttered station scenes.
Monocular Visual Odometry: Robust estimation of camera or ego-pose under constraints unique to guided railway kinematics and topology.

The benchmark provides carefully curated train/test splits, extensive annotation protocols, and robust scenario diversity encompassing urban, rural, high-speed, and staged scenes.

Figure 1: Annotated examples from the Object+Rail dataset, highlighting the diversity of railway environments and annotated entity classes.

Figure 2: Class distribution in the RAIL-BENCH Object+Rail dataset (excluding iscrowd/ignore instances), showing data balance across the object categories.

Figure 3: Example images from the specialized RAIL-BENCH Vegetation and Tracking datasets, illustrating the segmentation and tracking tasks.

Dataset Design, Acquisition, and Annotation Protocols

The data sources for RAIL-BENCH integrate public and proprietary video datasets (e.g., OSDaR23, OSDaR26), augmented by additional recordings and consented web-derived footage. Manual sampling maximizes environmental and operational heterogeneity. The annotation processes exploit established tools (CVAT) and systematically address the geometric and semantic ambiguities inherent in the railway domain, with policies for occlusions, crowd labeling, and ignore-region specification.

For rail detection, both left and right rail heads are annotated as extended polylines. Object detection follows standard axis-aligned bounding boxes augmented with iscrowd and occlusion metadata. Vegetation segmentation combines model-based bootstrapping (SegFormer pretrained and finetuned) with meticulous manual refinement to enforce consistent labeling of high/low vegetation, including track-encroaching cases.

Evaluation Metrics and the LineAP Proposal

A salient theoretical contribution is the introduction of LineAP, a novel segment-based average precision metric designed to overcome geometric and grouping limitations in prior line detection metrics such as TuSimple Acc, CULane F1, and ChamferAP.

LineAP decomposes every predicted and ground-truth rail polyline into short, fixed-length segments. Matchings are established using dual constraints: geometric proximity (relative pixel distance) and angular alignment (orientation difference), followed by bipartite graph-based optimal assignment per confidence score. This mechanism explicitly supports partial-detection validation, penalizes over-segmentation, and permits detailed error diagnostics.

Figure 4: Schematic depiction of the LineAP segment matching process, with geometry and orientation thresholds governing matches.

LineAP is applied alongside classic AP/ChamferAP, yielding a multi-metric characterization that disambiguates failures in geometric localization from errors in rail instance grouping.

Figure 5: Qualitative analysis comparing LineAP with traditional metrics (TuSimple, CULane, ChamferAP); color-coding exposes metric-specific disagreement in segment correctness.

Object detection is evaluated via COCO-style mAP over multiple IoU thresholds and difficulty stratifications (by visibility and occlusion). Vegetation segmentation utilizes IoU (Jaccard Index); tracking uses HOTA; and odometry is assessed using Umeyama-aligned APE/RPE.

Empirical Results and Analysis

The paper benchmarks state-of-the-art architectures on Object and Rail Track challenges:

Object Detection: YOLOv11L models (pretrained on COCO, and further finetuned on RAIL-BENCH) and YOLOv8L-World (zero-shot inference with prompt conditioning) exhibit significant domain-shift effects. Models trained or finetuned on RAIL-BENCH data substantially outperform general-purpose or foundation models, even on generic classes such as "person" and "road vehicle," indicating that the railway visual domain is sufficiently distinct to necessitate in-domain training.
Rail Track Detection: PINet and YOLinO architectures are compared. YOLinO, optimized for local segment detection, consistently excels on LineAP due to its segment-centric architecture—producing numerous locally accurate segments that do not necessarily correspond to globally continuous rails. PINet delivers better performance on ChamferAP, reflecting its instance-focused prediction style. Fine-tuning on the RAIL-BENCH dataset yields consistent performance accruals for both models, except in case-specific metrics.

The dual-metric (LineAP + ChamferAP) reporting demonstrates that geometric correctness and grouping ability must be evaluated separately for actionable insight on model limitations.

Implications, Limitations, and Future Directions

RAIL-BENCH constitutes a critical infrastructure for empirical perception research in railway automation, providing:

Standardization: Enables reproducible, fair comparisons across algorithms, closing the methodological gap with automotive benchmarks.
Detailed Diagnostics: By employing tailored metrics (LineAP, dual-objective), the suite exposes nuanced model failure modes, supporting targeted architectural advances.
Domain-Generalization Insights: The observed necessity of domain-specific fine-tuning (despite strong foundation models) has direct implications for foundation model adaptation strategies in low-resource, specialized perception tasks.

Practically, the benchmark lays the foundation for reproducible railway perception systems, necessary for the progression to higher autonomy levels in non-segregated, open railway infrastructure.

Theoretically, the introduction of LineAP offers a transferable metric framework applicable to polyline detection in other structured environments, such as road lane detection, with specific advantages in metric explainability and alignment with operational requirements.

Future work should expand the suite's data diversity—e.g., covering more adverse weather, night-time scenarios, or rare safety-critical events—and extend to additional perception tasks (e.g., signal aspect recognition, multi-modal fusion). The modular design facilitates seamless integration of future challenges as the field evolves.

Conclusion

RAIL-BENCH provides, for the first time, a rigorous, reproducible testbed for camera-based perception in the railway domain. By integrating scenario-diverse datasets, tailored annotation protocols, and robust evaluation metrics (notably LineAP), it establishes a methodological foundation for accelerated algorithmic progress and fair empirical comparison in railway perception. Its open leaderboard-style infrastructure and metric innovations are positioned to significantly shape future research trajectories and practical deployments in automated railway operation.

Markdown Report Issue