Learning ECG Image Representations via Dual Physiological-Aware Alignments

Published 2 Apr 2026 in cs.LG | (2604.01526v1)

Abstract: Electrocardiograms (ECGs) are among the most widely used diagnostic tools for cardiovascular diseases, and a large amount of ECG data worldwide appears only in image form. However, most existing automated ECG analysis methods rely on access to raw signal recordings, limiting their applicability in real-world and resource-constrained settings. In this paper, we present ECG-Scan, a self-supervised framework for learning clinically generalized representations from ECG images through dual physiological-aware alignments: 1) Our approach optimizes image representation learning using multimodal contrastive alignment between image and gold-standard signal-text modalities. 2) We further integrate domain knowledge via soft-lead constraints, regularizing the reconstruction process and improving signal lead inter-consistency. Extensive benchmarking across multiple datasets and downstream tasks demonstrates that our image-based model achieves superior performance compared to existing image baselines and notably narrows the gap between ECG image and signal analysis. These results highlight the potential of self-supervised image modeling to unlock large-scale legacy ECG data and broaden access to automated cardiovascular diagnostics.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents ECG-Scan as a multimodal self-supervised framework that aligns ECG images with signal and text encoders using dual physiological-aware alignment for improved diagnostic performance.
It employs a Gramian-based contrastive loss and soft lead consistency to enforce geometric and physiological constraints, enhancing both signal reconstruction and diagnostic separability.
Results show that ECG-Scan achieves competitive linear probing and zero-shot AUC scores, demonstrating robustness to domain shifts and potential for automating legacy ECG image analysis.

Learning ECG Image Representations via Dual Physiological-Aware Alignments

Motivation and Context

Automated ECG analysis has undergone significant advances, primarily driven by self-supervised and multimodal learning approaches applied to digital ECG signals. However, real-world clinical practice and resource-constrained environments are dominated by paper- and image-based ECG records, with raw digital signals often unavailable (Figure 1). This modality gap restricts current foundation models to a subset of cases and creates barriers for large-scale retrospective analysis and broader diagnostic accessibility.

Figure 1: Typical workflows of ECG acquisition, archival, and remote review, emphasizing the prevalence and importance of image-based ECGs, especially in low-resource settings.

Recent vision-LLMs and digitization pipelines have begun leveraging ECG images for downstream tasks, but generic vision encoders and handcrafted conversion steps are suboptimal for the spatiotemporal structure and physiological semantics unique to ECGs. Clinical ECG interpretation relies not only on visual forms but also on the precise morphological and inter-lead relationships encoded in the spatial layout of ECG images.

Multimodal Framework Overview

The proposed ECG-Scan framework unifies three modalities—ECG images, signals, and clinical text—during self-supervised pretraining (Figure 2). The core architecture consists of:

A CLIP-based ECG image encoder, adapted via LoRA for strong feature extraction.
Frozen signal and text encoders (D-BETA and Bio-Med-CPT respectively) as teacher models for physiologically grounded and diagnostic guidance.
A Transformer-based signal decoder for reconstructing full 12-lead ECG signals from image representations.

ECG-Scan aligns image representations with physiologically rich signal and semantic text features, yielding clinically meaningful embeddings in both latent and reconstructed signal spaces.

Figure 2: ECG-Scan’s multimodal architecture aligning image, signal, and text representations via dual physiological-aware alignment strategies.

Dual Physiological-Aware Alignment Strategy

1. Gramian-Based Contrastive Alignment

ECG-Scan introduces a Gramian-based three-way contrastive loss, supplementing standard image-text contrastive learning. This approach enforces higher-order geometric alignment across image, signal, and text representations. The Gramian loss regularizes embeddings such that image-derived features approach the physiological and semantic structure of signal-text teacher modalities. Empirically, this improves diagnostic separability and enables strong zero-shot image-text performance, matching or surpassing signal-based counterparts.

2. Soft Lead Consistency Alignment

Signal reconstruction fidelity is further regularized using soft-lead physiological constraints based on Einthoven’s and Goldberger’s laws (Figure 3). Unlike strict equality, these soft constraints gently penalize violations of electrophysiological relationships among limb leads, improving robustness to real-world noise and signal incompleteness. The approach avoids over-constraining the model, stabilizing training and ensuring that reconstructions adhere to plausible cardiac morphology.

Figure 3: Einthoven’s Law and Goldberger’s equations encoding limb lead relationships, used for physiological regularization of signal reconstruction.

Data Synthesis, Pretraining, and Evaluation Protocols

To circumvent the scarcity of paired image-signal-text datasets, ECG-Scan synthetically renders realistic ECG images from the large-scale MIMIC-IV-ECG signal-text corpus. During pretraining, images are dynamically augmented to mimic acquisition variations (Figure 4), driving robustness and generative diversity. Downstream evaluation covers linear probing and zero-shot classification across PTB-XL, CSN, CPSC2018, and CODE-test datasets. Model checkpoints and infrastructure follow standardized protocols for reproducibility.

Figure 4: Examples of stochastic image augmentations applied during pretraining, simulating real-world variability and improving robustness.

Numerical Results and Comparative Analysis

Linear Probing

ECG-Scan outperforms generic image-only and image-to-signal baselines by 3–12% absolute AUC in most supervised settings, and narrows the gap with signal-based foundation models to within 1–4% under full supervision. Notably, with only 2.5s temporal context available for most leads in the images, ECG-Scan occasionally surpasses several strong signal models trained on full 10s signals.

Figure 5: Impact of lead duration on linear probing performance, showing strong robustness of ECG-Scan for image-derived representations.

Zero-Shot Diagnostics

ECG-Scan achieves an average zero-shot AUC of 75.8%, outperforming all image-to-signal conversion baselines and nearly matching signal-text foundation models (MELP, D-BETA). On CODE-test, ECG-Scan’s zero-shot image-text classification surpasses both medical residents and medical students (AUC 94.78–93.61%).

Domain Shift

Under cross-dataset distribution shift, ECG-Scan maintains an average AUC of 80%—slightly exceeding the highest-performing signal foundation model (MELP, 79.6%)—demonstrating remarkable distributional robustness and transferability.

Ablations and Modality Sensitivity

Disabling either Gramian or soft-lead alignment degrades performance by 2–9%; the full dual alignment strategy is essential for achieving strong diagnostic separation. The approach is compatible with alternate encoders, though CLIP and Bio-Med-CPT remain optimal.

Theoretical and Practical Implications

ECG-Scan substantiates the claim that self-supervised multimodal pretraining can extract physiologically faithful and diagnostically useful ECG representations from images. The dual alignment strategy, informed by domain physiology and geometric relational losses, generalizes beyond handcrafted digitization and outperforms generic visual encoders for downstream classification, report generation, and domain adaptation tasks.

Practically, this unlocks massive global archives of legacy ECG images for automated analysis and supports equitable diagnostic access in regions lacking digital signal infrastructure. Theoretically, it motivates further exploration of physiological priors and geometric multimodal losses as universal principles for medical imaging foundation models. The approach lays the groundwork for future ECG vision-LLMs, multimodal report generation, and interpretable image-based diagnostics.

Conclusion

ECG-Scan presents the first multimodal self-supervised framework for learning robust ECG image representations via dual physiological-aware alignment. Comprehensive evaluation demonstrates performance competitive with signal-based foundation models, strong zero-shot diagnostic capabilities, and resilience to domain shift. Future avenues include scaling pretraining to larger heterogeneous ECG image archives and validating under real clinical imaging conditions. The methodology paves the way for democratized access to automated cardiovascular diagnostics and physiologically informed multimodal representation learning.