Thursday, December 4, 2025

Evaluating Machine Learning Models for Predicting Hepatitis

Share

Understanding Medical Data: Insights from Study Variables and Machine Learning Models

Exploring the Demographics and Characteristics of the Study Population

Table 2 presents a comprehensive breakdown of the demographic variables examined in this study, detailing both categorical and continuous data. For categorical variables, it includes the frequency and percentage of individuals within each category, which helps contextualize the study population’s characteristics.

Categorical Variables

Starting with sex, the study population comprises 139 males (89.7%) and 16 females (10.3%). This significant gender disparity can be vital in understanding the impact of hepatitis on different sexes. Regarding steroid use, 76 individuals (49.0%) reported using steroids, while 79 individuals (51.0%) did not. The implications of steroid use on liver health may warrant further examination, as this lifestyle factor is critical in assessing risk.

Fatigue is another critical variable, with 130 individuals (83.9%) reporting the condition. Fatigue can often be a symptom associated with liver disease, suggesting a potential area for further research. Additionally, liver enlargement was noted in 24 individuals (15.5%), indicating a healthy majority without this condition (131 individuals or 84.5%).

Ascites, a common complication associated with liver disease, affected 51 individuals (32.9%) while 104 did not. The histology results indicated that 85 individuals (54.8%) showed histological findings consistent with liver damage, emphasizing the need to explore both clinical symptoms and laboratory findings during diagnosis.

Continuous Variables

The continuous variables listed in Table 2 include age, bilirubin levels, alkaline phosphatase, aspartate transaminase, albumin, and pro-time. These measurements allow for richer statistical analyses but, unfortunately, the frequencies or percentages associated with them were not specified. Nevertheless, these continuous metrics are crucial in providing a more granular perspective on liver health.

Connecting Variables with Clinical Outcomes

Moving to Table 3, the focus shifts to the relationship between these variables and patient outcomes, specifically death (D, L = 1) and survival (D, L = 2). This table displays the intricate interplay between categorical variables such as sex, steroid use, spleen palpability, and more, illustrating how they correlate with survival rates.

Gender and Steroid Use Insights

Among male participants, 32 individuals died while 107 survived. Interestingly, all female participants (16) survived, highlighting a noteworthy trend in gender differences concerning hepatitis outcomes. In terms of steroid use, 20 out of 76 individuals who used steroids died, whereas 56 survived. Contrastingly, 12 deaths and 67 survivals occurred among those who did not use steroids. This data may indicate that steroid use could negatively influence survival rates.

Examining Additional Conditions

The table also explores the impact of conditions such as spleen palpability and ascites on survival outcomes. Individuals with palpable spleens (12 deaths and 18 survivors) showed different survival prospects compared to those without (20 deaths and 105 survivors). The observations suggest a deeper connection between medical conditions related to liver health and survival.

Feature Importance in Hepatitis Outcomes

Table 4 focuses on the findings from the Boruta algorithm, which assesses feature importance in relation to hepatitis outcomes. By combining data-driven insights with clinical expertise, the analysis sheds light on what variables are crucial in understanding hepatitis severity and progression, particularly concerning liver cirrhosis.

Key Variables Identified

Among the most significant features are ascites, varices, bilirubin, age, spiders, and alkaline phosphatase, all receiving high importance scores (above 0.85). These factors indicate advanced liver dysfunction. For instance, the presence of ascites and varices is often associated with severe liver damage and portal hypertension, making them essential markers for prognosis. Higher bilirubin levels and longer prothrombin time also reflect liver functionality impairments.

Moderate and Low Importance Features

Several features, categorized as tentative, included prothrombin time and antiviral treatment. Though these are clinically relevant, their statistical contribution was inconsistent, possibly due to redundancy with stronger predictors. For example, while albumin is critical for assessing liver function, its predictive clarity may get diluted by external factors. On the other hand, variables such as aspartate transaminase (AST), fatigue, and histology did not demonstrate significant relevance in the algorithmic context, indicating a need for careful selection of features for predictive modeling.

Evaluating Predictive Models

Table 5 lays out the performance evaluation metrics for various machine learning classifiers, assessing their effectiveness in predicting hepatitis outcomes. Metrics include accuracy, precision, sensitivity, specificity, and F1 score, along with corresponding 95% confidence intervals.

Logistic Regression and Support Vector Machine

Logistic Regression (LR) stands out as a reliable baseline model with robust accuracy (85.00%) and precision (94.03%). Its interpretability makes it a strong choice for clinical applications, despite moderate specificity (55.56%). Support Vector Machine (SVM) provides excellent sensitivity (89.71%) and precision (91.04%) but suffers from limited specificity (50.00%), making it valuable in contexts prioritizing the identification of positive cases.

Random Forest: The Top Performer

Random Forest (RF) emerges as the best-performing model across various metrics, boasting an accuracy of 92.42% and a high precision of 96.77%. Its sensitivity of 95.24% indicates its ability to detect true positives effectively, although its specificity of 33.33% highlights a tendency to misclassify negatives as positives. This trade-off makes RF particularly valuable when the emphasis is on high recall.

K-Nearest Neighbors and Neural Networks

K-Nearest Neighbors (KNN) scores high on specificity (87.76%) but shows lower sensitivity (78.95%), ideal for scenarios where minimizing false positives is essential. Artificial Neural Networks (ANN) provide a well-rounded performance with decent accuracy and sensitivity but may not always excel in more critical metrics.

AdaBoost and XGBoost Performance

AdaBoost shines with the highest specificity (95.65%) but struggles with sensitivity (50.00%), limiting its usefulness in scenarios where missing hepatitis cases is unacceptable. Conversely, XGBoost offers a balanced approach with reasonable metrics across the board, making it a versatile option when both sensitivity and specificity are necessary.

ROC Curve Insights

The ROC curve depicted in Fig. 5 illustrates how each model distinguishes between positive and negative cases. Among the leading models, Random Forest demonstrates the highest sensitivity at 0.95, while XGBoost balances strong sensitivity (0.78) with high specificity (0.88). This visual representation reinforces the earlier findings regarding model performance, indicating where each model stands on the spectrum of predicting hepatitis effectively.

In summary, the study’s findings underscore the intricate relationships between various medical variables and survival outcomes in hepatitis, while the machine learning analyses provide a powerful tool for improving predictive capabilities in clinical settings. Understanding these relationships could be pivotal for enhancing treatment outcomes and patient management in liver disease cases.

Read more

Related updates