Attention-ResUNet for Automated Fetal Head Segmentation

Published 20 Apr 2026 in cs.CV and cs.LG | (2604.18148v1)

Abstract: Automated fetal head segmentation in ultrasound images is critical for accurate biometric measurements in prenatal care. While existing deep learning approaches have achieved a reasonable performance, they struggle with issues like low contrast, noise, and complex anatomical boundaries which are inherent to ultrasound imaging. This paper presents Attention-ResUNet. It is a novel architecture that synergistically combines residual learning with multi-scale attention mechanisms in order to achieve enhanced fetal head segmentation. Our approach integrates attention gates at four decoder levels to focus selectively on anatomically relevant regions while suppressing the background noise, and complemented by residual connections which facilitates gradient flow and feature reuse. Extensive evaluation on the HC18 Challenge dataset where n = 200 demonstrates that Attention ResUNet achieves a superior performance with a mean Dice score of 99.30 +/- 0.14% against similar architectures. It significantly outperforms five baseline architectures including ResUNet (99.26%), Attention U-Net (98.79%), Swin U-Net (98.60%), Standard U-Net (98.58%), and U-Net++ (97.46%). Through statistical analysis we confirm highly significant improvements (p < 0.001) with effect sizes that range from 0.230 to 13.159 (Cohen's d). Using Saliency map analysis, we reveal that our architecture produces highly concentrated, anatomically consistent activation patterns, which demonstrate an enhanced interpretability which is crucial for clinical deployment. The proposed method establishes a new state of the art performance for automated fetal head segmentation whilst maintaining computational efficiency with 14.7M parameters and a 45 GFLOPs inference cost. Code repository: https://github.com/Ammar-ss

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces an innovative Attention-ResUNet architecture that integrates residual learning with multi-scale attention for enhanced fetal head segmentation.
It achieves a mean Dice coefficient of 99.3%, outperforming standard U-Net variants and demonstrating statistically significant improvements.
The study validates model robustness and interpretability with Grad-CAM saliency analysis, ensuring reliable deployment in clinical settings.

Attention-ResUNet for Automated Fetal Head Segmentation: Summary and Analysis

Introduction

Accurate segmentation of the fetal head in ultrasound (US) imagery is indispensable for reliable prenatal biometric analysis. Persistent challenges in clinical ultrasonography—such as pronounced speckle noise, low tissue contrast, and ambiguous anatomical boundaries—undermine segmentation accuracy. Canonical architectures like U-Net established the blueprint for biomedical segmentation but remain susceptible to gradient attenuation in deep layers and exhibit indiscriminate feature selection. The presented work proposes the Attention-ResUNet, an architecture that combines residual learning with hierarchical multi-scale attention mechanisms, engineered for robust, interpretable fetal head segmentation in challenging US conditions (2604.18148).

Architectural Innovations

The Attention-ResUNet integrates multi-scale attention gates at four consecutive decoder levels (facilitating 64, 128, 256, 512 channel feature maps) atop an identity-mapping residual U-Net backbone. This configuration ensures both spatially selective gating of salient anatomical structures and uninterrupted gradient flow throughout the network—critically addressing both feature localization and training stability.

Residual blocks reformulate layers into additive mappings $F(x) + x$ , facilitating deep network optimization. Attention gates receive features from both the encoder and current decoder levels, compute spatial attention coefficients via parameterized convolutions, and modulate encoder features, effectively focusing on structurally relevant regions while suppressing background artifacts.

Figure 1: (a) ROC curve, (b) Precision vs Recall curve, and (c) Confusion Matrix for the Attention ResUNet Architecture.

The union of these mechanisms in the proposed U-shaped network allows robust multiscale feature aggregation, with each skip connection passing through an attention gate for context-aware refinement. A final pointwise convolution and sigmoid activation yield the segmentation probability map.

Experimental Setup

Evaluation employed the HC18 Challenge dataset ( $n=200$ for validation), with preprocessing—including intensity normalization, random geometric and photometric augmentation—to emulate real-world clinical image variability. Primary metrics included Dice Similarity Coefficient (DSC), Intersection over Union (IoU), precision, recall, F1-score, Hausdorff Distance (HD), and Average Symmetric Surface Distance (ASD). Optimization utilized Adam (lr= $1\times 10^{-4}$ ), batch normalization, and deterministic training for reproducibility. Baseline comparisons encompassed ResUNet, Attention U-Net, Swin U-Net, standard U-Net, and U-Net++.

Quantitative Performance

Attention-ResUNet achieved a mean Dice of $99.30 \pm 0.14\%$ , surpassing all baselines, including ResUNet ( $99.26\%$ ), Attention U-Net ( $98.79\%$ ), Swin U-Net ( $98.60\%$ ), standard U-Net ( $98.58\%$ ), and U-Net++ ( $97.46\%$ ), with all improvements statistically significant ( $p < 0.001$ ; effect sizes $n=200$ 0– $n=200$ 1). The AUC-ROC was 0.896 and AUC-PR was 0.998, evidencing superior sensitivity, specificity, and precision-recall trade-offs. The confusion matrix highlighted a negligible rate of false negatives.

Boundary metrics further validated the architecture’s efficacy: median HD was 38.0 pixels and median ASD was 13.2 pixels—minimal compared with all comparators.

Figure 2: Hausdorff Distance (left) and ASD (right) distributions.

Figure 3: Hausdorff and ASD Distance correlation analysis.

Statistical robustness was systematically established using paired t-tests and confidence intervals; the narrow interquartile ranges in Dice and IoU further reflected high segmentation consistency.

Figure 4: Statistical significance against the proposed model.

Interpretability via Saliency Analysis

Model interpretability was systematically assessed using Grad-CAM-based saliency mapping and comparative visualization across candidate architectures.

Figure 5: Saliency map with diffuse activation patterns for ResUnet

Figure 6: Saliency map showing precise focus on target region via Attention UNet

Figure 7: Saliency map showing enhanced spatial precision via Attention ResUNet

ResUNet displayed diffuse activations spanning irrelevant regions. Attention U-Net sharply concentrated activations but occasionally missed structural completeness under challenging cases. Attention-ResUNet consistently yielded focused, anatomically congruent activations—exemplified by the highest spatial concentration index ( $n=200$ 2, $n=200$ 3 vs. all baselines). This indicates that multi-scale attention in conjunction with residual channels materially enhances clinical interpretability and trust.

Computational Considerations

Remarkably, despite architectural complexity, computational efficiency was maintained: 14.7M parameters and 45 GFLOPs per $n=200$ 4 slice, delivering 23 ms inference on RTX 3080 hardware. This supports deployment in time-critical clinical workflows without resource constraints.

Limitations and Outlook

Although state-of-the-art performance was conclusively demonstrated on HC18, cross-manufacturer variability and broader clinical acquisition diversity remain untested. Degraded performance was noted under substantial acoustic shadowing (~3–5% scans) and non-standard head planes, with Dice dropping to 97–98%. Future work should prioritize extension to volumetric 3D US, clinical multi-center validation, uncertainty quantification for error-awareness, and domain adaptation for cross-device generalization.

Conclusion

Attention-ResUNet delivers substantial performance gains in automated fetal head segmentation, leveraging the synergy of residual learning and hierarchical attention for robust and interpretable prediction. Both strong statistical evidence and saliency interpretability underscore its readiness for clinical integration—provided prospective validation on broader populations and imaging conditions.

The demonstrated paradigm—residual backbone supporting multi-scale attention refinement—presents a robust template for further research in medical segmentation and other challenging domains where fine-grained, trustworthy localization is essential.

Markdown Report Issue