Dataset and Readers
In a groundbreaking study, a comprehensive dataset was collected from Shenzhen Children’s Hospital (SZCH) spanning a period from January 2021 to April 2024. The total dataset comprised 1518 cases, with a gestational age (GA) of 35.00 weeks (95% Confidence Interval [CI]: 25.14–40.86). This development dataset was carefully separated into training and testing subsets in an 80:20 ratio for robust model validation.
In addition, an internal test dataset encompassing 199 cases (GA 36.14 weeks; 95% CI: 27.23–41.00) was assembled from January to June 2023, also sourced from SZCH. Beyond this, an external test dataset was gathered from three different medical centers—Guangzhou Panyu District Maternal and Child Health Care Hospital (GZMCH), Sichuan Provincial Maternity and Child Health Care Hospital (SCMCH), and Changsha Hospital for Maternal and Child Health Care (CSMCH)—comprising 356 cases with a GA of 37.00 weeks (95% CI: 26.57–40.86), collected between October 2023 and April 2024.
Each case included a rich array of four to six cranial ultrasound (CUS) standard views: the anterior horn view (AHV), third ventricle view (TVV), and body view (BV) from the coronal plane, as well as midsagittal view (MSV), right parasagittal view (RPSV), and left parasagittal view (LPSV) from the sagittal plane. Additionally, two types of CUS videos were part of each case—both coronal and sagittal sweeps.
In Table 1, readers can find a summary of the demographic information for the infants included in this study, while Supplementary Figure 1 illustrates the distribution of various diseases, GA groups, and probe types across the datasets. Detailed information about the imaging specifics, including transducer frequency, manufacturer, model name, depth, and more, can be found in Supplementary Tables 1 and 2.
NCLS Workflow Overview
The Neonatal Cerebral Lesions Screening (NCLS) system operates through a two-stage workflow. Stage 1 is dedicated to detecting anatomical structures, allowing for the extraction of standard views from CUS videos based on detection outcomes. Stage 2 concentrates on the diagnostic task, where the system classifies the severity of cerebral lesions reflected in the extracted views. Training and validation of the NCLS were performed using a five-fold cross-validation strategy within the development dataset, with its performance being assessed against the internal and external test datasets.
To facilitate this study, twenty-five radiologists of varying experience levels in CUS diagnosis were recruited. The group included junior radiologists with approximately 1–2 years of clinical experience, mid-level radiologists with 3–7 years, and senior radiologists with more than 8 years, including authoritative experts with over 15 years of experience. The senior radiologists contributed to the annotation of the development dataset, ensuring meticulous case evaluation. Each CUS image in the development set received independent diagnosis, categorizing conditions such as normal, intraventricular hemorrhage (IVH), ependymal cyst, ventricular dilation, hydrocephalus, or periventricular leukomalacia (PVL). In comparison, the internal and external test set’s diagnostic labels aligned with the clinical workflow, derived from the CUS videos.
Comparative Study: NCLS vs Radiologists
The performance of the NCLS was juxtaposed against the diagnostic capabilities of nine junior and eleven mid-level radiologists using the internal and external test datasets. The results, presented in Table 2, highlight the NCLS achieving a sensitivity of 0.875 (95% CI: 0.687–1.000) and specificity of 0.934 (95% CI: 0.897–0.967) on the internal test dataset. The area under the receiver operating characteristic curve (AUC) registered at 0.982 (95% CI: 0.958–0.997), along with an F1-score of 0.667.
In contrast, junior radiologists demonstrated an average sensitivity of 0.875 (95% CI: 0.810–0.924) but a notably lower specificity of 0.851 (95% CI: 0.833–0.868), indicating a tendency toward excessive false-positive results. Mid-level radiologists fared somewhat better with a sensitivity of 0.813 (95% CI: 0.747–0.867) and a specificity of 0.986 (95% CI: 0.980–0.991). Statistical analyses utilizing Fleiss’ Kappa substantiated a strong agreement within the mid-level radiologist group, while junior radiologists exhibited low inter-rater agreement.
In the external test dataset, the performance metrics showcased that the NCLS excelled with a sensitivity of 0.962 (95% CI: 0.869–1.000) and specificity of 0.927 (95% CI: 0.899–0.953). The consistency of the NCLS stood tall, as its diagnostic performance remained superior and reliable compared to the variability seen among radiologists.
AI Enhanced Performance of Junior Radiologists
Further exploration validated the transformative role of the NCLS in enhancing the diagnostic accuracy among junior radiologists. In a sequential study, they were asked to reassess the test cases with the auxiliary support of the NCLS, a month subsequent to their initial evaluations. As shown in Table 2, sensitivity markedly improved—rising by an average of 9.72% in the internal test set and 19.24% in the external test set—as did specificity.
Results indicated that the NCLS facilitated significant rectifications of misdiagnosed cases, particularly in severe instances, enhancing the overall accuracy without compromising specificity. Furthermore, the majority of junior radiologists exhibited notable improvements in their diagnostic metrics, suggesting that AI assistance can serve as an invaluable asset in clinical settings.
The Randomized Trial: AI vs. Radiologists
To objectively measure the diagnostic prowess of the NCLS while reducing bias, a blind and randomized trial was implemented, involving nine junior radiologists. Through a sophisticated blinding methodology, the diagnostic material was randomized and presented, either generated by the AI system or by the junior radiologists.
The results revealed that the AI-led subgroup required substantially fewer amendments in the diagnosis than the junior radiologists’ predictions, with only a 5.5% revision rate compared to 20.7% for the latter group. The speed of diagnosis was also notably enhanced when relying on AI-generated materials, signaling the efficiency of AI in clinical workflows.
NCLS Evaluation Using Blind Sweeping Data
In an ambitious effort to evaluate its capabilities under real-world limitations, the NCLS was further validated using blind sweeping data, where operators could not monitor the ultrasound screen. The extracted standard views were independently reviewed, meeting clinical diagnostic standards, as shown in various qualitative assessments performed by senior radiologists.
The NCLS not only achieved commendable predictive scores but also established itself as a vital tool in scenarios where experienced radiologists are in short supply, thereby holding promise for broader applications in underserviced or resource-limited environments.
Through careful construction, analysis, and evaluation, this study underscores the potential of integrating AI systems within neonatology, paving the way for improved diagnostic outcomes and optimal patient care.