Over a century ago, Willem Einthoven laid the foundation for the 12-lead ECG, which remains a cornerstone of daily clinical practice. Since then, researchers and clinicians have identified numerous ECG features critical for diagnosing cardiovascular diseases (CVDs), such as arrhythmias, coronary artery diseases, heart failure, and valve disease. With the advancement of computational power and the rise of machine learning (ML) approaches in various areas of daily life, there is growing interest in leveraging ML to extract novel features from ECGs for more precise and reliable disease detection. While AI-powered arrhythmia detection (e.g., atrial fibrillation) is already integrated into consumer devices like smartwatches, AI-based CVD detection from ECG remains under development.
To address this need, Hasan et al. present a sophisticated method to predict four cardiac abnormalities, namely abnormal heartbeats, myocardial infarction, history of myocardial infarction, and normal heartbeats, using simple ECG printouts1. They employed a combination of machine learning and deep learning techniques for both feature extraction and classification, achieving an impressive accuracy of up to 99.29%. Notably, this high performance was achieved despite the known limitations of digitized ECG printouts, where printer quality and digitization processes can introduce significant variations in interval measurements.2,3
Similar to several other highly technical classification studies by ML specialists focusing on clinical diagnosis from ECGs, the reported accuracy is remarkably high. This raises the question: Is this high performance true, or where might it originate from?
As in other studies on ML-based disease classification using ECGs, the group nicely emphasizes model reporting and reproducibility by describing all steps of modelling and data processing. They present all ML performance metrics to characterize discrimination and classification, including receiver operator area under the curve (ROC-AUC), precision, recall, accuracy, and F1-score, alongside calibration plots. However, for a trustworthy implementation of AI in cardiology, essential quality standards, such as rigorous model validation remain unaddressed.4,5 Furthermore, a comprehensive external validation could have highlighted key limitations of the training datasets:
- The clinical significance of the defined classes warrants scrutiny. While identifying patients with myocardial infarction (MI) or a history of MI is clinically relevant, classifying patients with abnormal or normal heartbeats offers limited clinical utility in a study aimed at cardiovascular disease prediction. A clear clinical use case for the AI-based model, as suggested by van Royen et al.5, is lacking.
- Upon validating the raw dataset, it becomes evident that the myocardial infarction patients predominantly present with clear STEMI patterns, which can be easily identified without machine learning approaches. However, the distinction between patients with STEMI and those with a history of MI remains unclear and insufficiently specified, with several patients in both groups showing STEMI features. The seemingly strong discrimination between the two groups could result from technical differences in the ECG acquisition systems, such as a higher low-pass filter setting of 100 Hz in the 2020 dataset compared to 25 Hz in previous datasets, as well as differences in the ECG lead arrangement in the figures. Furthermore. patients categorized in the abnormal heartbeat group appear to be predominantly characterized by tachycardic rhythms, further limiting the diagnostic value of this class.
This study was selected as a representative example to emphasize the importance of clearly defining data sources, their quality, and characteristics as well as a thorough validation of the model for the trustworthy implementation of AI in clinical practice. Adherence to standardized definitions and recommendations for the development of ML models in cardiology, as proposed for instance by van Smeeden et al.,6 could help address this challenge. Further studies on this topic, along with feedback from the stakeholders and recommendations from the professional societies are warranted to reduce the number of published studies with limited clinical value.