Understanding the Role of Data Sources and Social Determinants of Health in Maternal Outcomes
Data Sources
In the pursuit of improving maternal health outcomes, researchers often turn to comprehensive databases that capture critical patient information. This study uniquely utilized discharge notes from the MIMIC-III and MIMIC-IV critical care databases, both of which are publicly accessible. MIMIC-III, introduced in 2016, served as the primary dataset for model training and internal testing, while MIMIC-IV, released later, was utilized for external model evaluation.
For MIMIC-III, the focus was primarily on discharge notes related to female patients diagnosed with pregnancy-related conditions as classified by ICD-9 codes. These codes delineated normal pregnancies (codes 650–659) from those marked by complications (codes 630–639, 660–669). To ensure the relevance and depth of the data, specific inclusion criteria were established: notes had to include a "Social History" section that was not left empty and contained discharge summaries of the latest pregnancies in cases where patients had multiple pregnancies. Similar criteria were applied for the evaluation phase utilizing MIMIC-IV.
Selection of Social Determinants of Health Factors
Social Determinants of Health (SDoH) encompass various factors that influence health outcomes and can serve as powerful predictors of maternal health. In this study, the focus was narrowed down to three pivotal SDoH components: social support, occupation, and substance use. These were chosen based on several compelling reasons.
Strong Association with Maternal Health
Research has clearly documented that these components are closely associated with maternal health outcomes. For instance, adequate social support has been linked to lower risks of preterm birth and enhanced maternal mental health, while a deficiency can heighten vulnerability to adverse pregnancy outcomes. Occupational factors—encompassing employment status and work-related stress—have profound implications on pregnancy effects. Meanwhile, substance use remains a well-established risk factor, contributing to both immediate and long-term adverse outcomes for both mothers and infants.
Explicit Mention in Clinical Documentation
Another reason for selecting these SDoH components stems from their frequent appearance in the "Social History" sections of clinical notes. This frequent mention not only underscores their importance but also ensures that sufficient data is available for analysis. Furthermore, these factors often face underrepresentation or inconsistent coding within structured Electronic Health Record (EHR) fields, highlighting the necessity for focused research in this area.
Clinical Relevance and Underrepresentation
While these components are significantly associated with health outcomes, they often lack sufficient representation in clinical notes, making them ideal candidates for extraction and further analysis.
Annotation Protocol
The annotation of MIMIC-III notes was conducted meticulously by a single annotator, adhering to a clear set of criteria for each selected SDoH factor:
- Social Support: Labeled present if the social history explicitly mentioned living arrangements or strong familial support. Absent if homelessness or no mention was noted.
- Occupation: Marked present for any employment descriptions, while absent for unemployment or lack of mention.
- Substance Use: Coded as present for any current or past use of alcohol, tobacco, or drugs, whereas denial regarding substance use led to an absent label.
In the evaluation using MIMIC-IV, three annotators collaborated to ensure robustness. Each note was examined by two annotators, and any discrepancies in annotations were resolved through consensus discussions, ensuring consistency and quality of the data.
Preprocessing of Clinical Notes
Preprocessing clinical notes was crucial for effective feature extraction. The text underwent a series of transformations:
- Tokenization: Clinical notes were broken down into words and subwords using spaCy.
- Removal of Stopwords: Common English words that did not add value were removed to minimize noise.
- Text Standardization: This involved lowercasing and punctuation removal to maintain uniformity.
- Negation Handling: Phrases such as “denies smoking” were flagging using predefined rules to avoid misclassification.
Model Development, Internal Testing, and External Evaluation
The study adopted a two-phase model development strategy that ensured the generalizability of the findings. It utilized three distinctive approaches to extract SDoH information: rule-based methods, Word2Vec embeddings, and Clinical BERT models.
Rule-based Approach
The rule-based approach utilized the KeywordProcessor from the FlashText library to identify keywords and phrases associated with social support, occupation, and substance use. This method provided a straightforward and computationally efficient extraction of SDoH mentions; however, it had limitations in identifying nuanced contextual variations.
Word2Vec Approach
Utilizing pre-trained word embeddings through Word2Vec, this approach captured semantic relationships between words in clinical notes. Word embeddings were generated from a vast corpus of clinical text, which facilitated the modeling of complex relationships between words linked to social support, occupation, and substance use. Machine learning classifiers like Random Forest, Support Vector Classifier, and Decision Trees were employed for training.
Clinical BERT Approach
The Clinical BERT approach, a specially designed transformer model for clinical text, was leveraged to extract embeddings from discharge notes in MIMIC-III. By employing the AutoModel
and AutoTokenizer
from the transformers
library, this method delivered rich semantic representations of clinical notes, capturing their contextual nuances.
Evaluation Metrics
The performance of each model was assessed using several evaluation metrics: accuracy, precision, recall, and F1-score. Accuracy measured the overall correctness of predictions, while precision and recall provided insights into the models’ effectiveness in identifying actual instances of SDoH. The F1-score served as a balanced measure, ensuring that both precision and recall were adequately evaluated.
Association between SDoH and Pregnancy Complications
The relationship between extracted SDoH factors and pregnancy complications was quantified through statistical analyses, including logistic regression and chi-square tests on the MIMIC-IV cohort. This methodology examined the log-odds of experiencing complications as influenced by the SDoH predictors, thereby shedding light on the integral role these determinants play in maternal health outcomes.
Incorporating these data sources and analytical methods allows researchers to draw deeper insights into the social factors impacting maternal health, enhancing the potential for improved care and interventions.