Study Design and Participants
This study utilized a retrospective cohort design and drew on a substantial dataset from the Second Affiliated Hospital of Shaanxi University of Chinese Medicine. Initially, the study incorporated 26,540 records related to pregnancy, including both deliveries and threatened abortion cases among women aged 20 to 40. These records spanned from January 5, 2022, to March 29, 2024. Upon applying specific exclusion criteria, the final participant cohort consisted of 3,253 women, selected from the available retrospective data of eligible patients during the research period.
The protocol for this study received ethical approval from the hospital’s Ethical Committee, identified by Approval Number: LW2024004-1, dated April 23, 2024. This approval process aligned with the ethical guidelines dictated by the latest version of the Declaration of Helsinki. Given the retrospective nature of the study, participant information was anonymized, and thus individual informed consent was not a requirement.
This research conforms to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines, pertinent to observational studies, as well as the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines. The STROBE checklist can be found in Supplementary File 1, whereas the TRIPOD checklist is available in Supplementary File 2.
During the data screening phase, the research team employed a rigorous step-by-step strategy to safeguard data quality. An algorithm was initially utilized to batch screen patient data containing considerable amounts of missing or zero values. Following this, medical experts manually verified values that deviated significantly from the normal range, further refining the data by excluding abnormal entries.
The methodology adapted for processing patient data prioritized the most relevant information. For example, if a patient had both negative (healthy examination) and positive (threatened abortion) records, only the positive first visit data were retained. In cases where only one type of data was available, the first visit record was maintained, focusing strictly on examination results within the first trimester for patients diagnosed with threatened abortion.
Outlier data and specific instances, such as multiple pregnancies with at least three births, patients diagnosed with missed abortion, damaged eggs, or those undertaking planned termination of pregnancy, were systematically excluded. Additionally, patients with major ailments like severe coronary artery disease, stroke, malignant diseases, or those lost to follow-up were excluded from the analysis. Ultimately, the study enrolled 1,764 pregnant women with threatened abortion and 1,489 healthy pregnant women.
Inclusion and Exclusion Criteria
The study outlined clear inclusion criteria:
- Diagnosis of intrauterine pregnancy confirmed via clinical ultrasonography.
- The threatened abortion group consisted of women experiencing pregnancy-related vaginal bleeding, while the healthy pregnancy group included those without such complications.
- Participants were required to be aged between 20 and 40 years.
Conversely, the exclusion criteria encompassed:
- Multiple pregnancies with three or more births.
- Diagnoses of missed abortion, egg damage, or planned pregnancy termination.
- The presence of significant medical conditions such as severe coronary artery disease, stroke, or malignant diseases.
- Instances of lost follow-up.
Data Partitioning and Machine Learning Algorithms
Following data preprocessing and screening, the dataset was divided based on the common practice of a training set to validation set to test set ratio of 8:1:1. This structured partitioning resulted in 2,602 cases allocated to the training set, with 325 cases assigned to both the validation and test sets. Across this study, eight machine learning algorithms were utilized to adaptively analyze the data and produce final predictions.
Data Collection and Preprocessing
We meticulously gathered blood test data from all participants to construct a comprehensive dataset for analysis. At admission, routine blood tests were conducted utilizing advanced analyzers including the Mindray Automatic Blood Cell Analyzer BC-7500, Abbott CELL-DYN series, and Sysmex Europe-XN series blood cell analyzers.
To eliminate dimensional differences and data fluctuations among blood routine indicators impacting analysis results, the Z-score normalization method was employed. This normalization ensures better data comparability and strengthens model training stability. The Z-score normalization applies the following formula:
[
z=\frac{{x – \mu }}{\sigma }
]
Here, ( z ) represents the normalized value, ( x ) is the raw data point, ( \mu ) denotes the dataset mean, and ( \sigma ) signifies the standard deviation. This step is crucial in enabling fair comparisons across diverse data points, enhancing the model’s resilience to variability.
Model Construction and Optimization
For developing predictive models aimed at the early screening of threatened abortion, eight ML models were selected, each renowned for its specific strengths in addressing distinct predictive challenges. These models include:
-
Logistic Regression (LR): Primarily utilized for binary classification tasks, especially where interpretability is pivotal. LR is widely adopted in various clinical studies, making it a trustworthy choice.
-
Deep Neural Networks (DNN): Recognized for their capacity to decipher intricate patterns through multiple interconnected layers, DNNs excel in identifying complex relationships within large datasets.
-
Support Vector Machine (SVM): Effectively used for classification in high-dimensional spaces, SVMs find optimum hyperplanes separating different classes, particularly suitable for datasets with numerous features.
-
Extreme Gradient Boosting (XGB): An advanced tree-based learning algorithm that amalgamates several weak learners, XGB is lauded for its performance, accuracy, and scalability.
-
Gradient Boosting Machine (GBM): This model enhances predictive accuracy through sequential learning, correcting the errors of preceding models iteratively.
-
Decision Trees (DT): A supervised learning algorithm that divides data into subsets based on decision rules, facilitating efficient prediction and classification.
-
Naive Bayes (NB): A simplistic yet effective model ideal for text classification tasks, NB operates under the premise of predictor independence, ensuring computational efficiency.
- Random Forests (RF): An ensemble learning technique that constructs multiple decision trees, merging their predictions for greater accuracy and stability, especially in handling complex datasets.
Python (version 3.7.0) was employed to standardize the training process for all models, ensuring uniformity. Data was partitioned into training, validation, and test sets while maintaining consistent standardization mechanisms.
Hyperparameter optimization was executed via RandomizedSearchCV, a more efficient alternative to grid search, particularly in high-dimensional parameter spaces. The design of the search space considered hyperparameter configurations, varying learning rates, hidden layer sizes, activation functions, and intrinsic model characteristics.
Model Evaluation
To evaluate model performance and prevent overfitting, the study employed tenfold cross-validation. Performance metrics, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), were calculated, encompassing sensitivity, specificity, F1 score, and overall accuracy.
[
\text{Sensitivity} = \frac{TP}{TP + FN}
]
[
\text{Specificity} = \frac{TN}{TN + FP}
]
[
\text{F1 \, score} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}
]
[
\text{Accuracy} = \frac{TP + TN}{TP + FN + TN + FP}
]
SHAP Model Explanation
SHAP (Shapley Additive Explanations), founded on Shapley values, serves as a post-hoc interpretative tool quantifying feature contributions to predictions. Its design ensures local accuracy, adeptness at handling missing data, and maintaining consistency. SHAP identifies influential features by evaluating their impact on predictive power, thereby amplifying both the model’s accuracy and interpretability.
In mathematical terms, the model’s output can be defined as:
[
f(x) = \phi0 + \sum{i=1}^{N} \phi_i \cdot x_i
]
In this equation, ( \phi_0 ) represents the baseline prediction, ( N ) is the number of features, ( x_i ) denotes the value of the i-th feature, and ( \phi_i ) signifies its SHAP value, illustrating its effect on the model’s output.
Statistical Analysis
The statistical analysis was executed using IBM SPSS Statistics 26. The data retrieved included basic information and routine blood test results from the hospital information system. Patients diagnosed with threatened abortion were classified as the positive sample group, while healthy pregnant women constituted the negative sample group, leading to a dataset featuring 24 attributes.
To compare hematological parameter differences between groups, the study employed independent sample t-tests for features conforming to a normal distribution. For those that did not meet these criteria, the Mann-Whitney U test was utilized. In addition, the Bootstrap method was employed to estimate confidence intervals, thereby augmenting the reliability of the statistical findings.