Friday, October 24, 2025

Revolutionizing Breast Cancer Detection with Machine Learning and Explainable AI

Share

Dataset Description: UCTH Breast Cancer Dataset

The UCTH Breast Cancer Dataset serves as a critical resource for machine learning analyses aimed at improving breast cancer diagnosis. Uploaded to a trustworthy data repository called Mendeley Data in 2023, this dataset is derived from observations of 213 patients at the University of Calabar Teaching Hospital in Nigeria over a span of two years.

Dataset Features

The dataset comprises nine features, each providing vital information for understanding the patients’ conditions:

  1. Age: Continuous variable indicating the age of patients.
  2. Menopause Status: Categorical variable identifying whether a patient has reached menopause.
  3. Tumor Size: Continuous variable detailing the size of the tumor.
  4. Involved Nodes: Categorical variable indicating the number of lymph nodes affected.
  5. Area of Breast Affected: Categorical variable describing the specific region of the breast involved.
  6. Metastasis: Categorical variable representing whether cancer has spread to other parts of the body.
  7. Quadrant Affected: Categorical variable showing which quadrant of the breast is affected by the tumor.
  8. Previous History of Cancer: Categorical variable indicating if the patient has had cancer in the past.
  9. Diagnosis Result: Categorical target variable, coded as ‘0’ for benign and ‘1’ for malignant diagnoses.

The significance of these features lies in their ability to aid in decision-making processes related to treatment and prognosis. A comprehensive explanation of these features is encapsulated in Table 2 of the dataset documentation.


Statistical Preprocessing

To glean meaningful insights from the dataset, a robust statistical analysis was conducted using Jamovi, a user-friendly statistical software. The analysis first focused on descriptive statistics for continuous variables, summarized in Table 3. Violin plots showcased in Figure 2 illustrate the distribution of key variables: age and tumor size.

The analysis indicates that older women tend to have a higher incidence of malignant breast tumors, and larger tumor sizes correlate with malignant diagnosis. Statistical significance was assessed using T-tests, determining that features such as tumor size and age are critical to the model, with a p-value threshold set at less than 0.001 as indicated in Table 4.

Categorical variables received similar attention, visualized through bar plots in Figure 3. These plots reveal that pre-menopausal women tend to experience less severe forms of the disease. The presence of metastasis is a significant factor, especially when it spreads to auxiliary nodes, and prior instances of cancer increase the likelihood of an adverse diagnosis. A Chi-square test complemented this analysis, identifying key categorical features including menopause status, involved nodes, breast quadrant, and metastasis from the results shown in Table 5.


Data Preprocessing

Data preprocessing is essential for refining raw data into a usable format for analysis. To enhance the dataset’s quality, several steps were followed:

  1. Data Shuffling: This step minimizes bias by randomizing the order of data entries.
  2. Missing Values: A total of 13 null values were identified (‘NaN’) and were subsequently removed to maintain consistency.
  3. Label Encoding: Categorical variables were converted to numerical values, allowing the machine learning algorithms to interpret them effectively.
  4. Data Scaling: Max-Abs scaling was employed to ensure that all values were transformed within the range of -1 to 1, preventing larger values from skewing the analysis.

To identify significant features, both Mutual Information and Pearson’s Correlation were utilized. The correlation matrix, represented visually in Figure 4, highlights that involved nodes, tumor size, metastasis, and age have strong correlations with the diagnosis result. Mutual Information assessed the dependency between features and the target variable, with results depicted in Figure 5.

To address class imbalance, Borderline-SMOTE was applied, which generated synthetic samples, aiding in data balancing (50% each for benign and malignant cases). The dataset was then split into training and testing subsets using a ratio of 70:30.


Machine Learning and Explainable Artificial Intelligence (XAI)

The study engaged multiple machine learning classification techniques, notably a stacking algorithm that combines various classifiers for optimal performance. The classifiers employed include XGBoost, LightGBM, CatBoost, AdaBoost, KNN, Decision Trees, Logistic Regression, and Random Forest.

While tree-based models, such as XGBoost, LightGBM, and Random Forest, leverage ensemble techniques for superior performance, others like Decision Trees and Logistic Regression operate independently. The stacking approach utilizes the strengths of these diverse models, enhancing the overall predictive accuracy and minimizing overfitting.

Before commencing model training, hyperparameters were fine-tuned using GridSearchCV applied with 5-fold cross-validation for upholding the model’s effectiveness with unseen data.

Explainable Artificial Intelligence (XAI) concepts played a crucial role in enhancing the transparency and interpretability of model predictions. A suite of five XAI techniques was employed, namely SHAP, LIME, Eli5, QLattice, and Anchor.

  • SHAP (Shapley Additive Explanations): This technique assigns values to each feature indicating their contribution to predictions, thus providing a clear rationale behind model outputs.
  • LIME (Local Interpretable Model-agnostic Explanations): It assesses the influence of small variations in input on predictions, lending clarity particularly useful for individual patient scenarios.
  • Eli5 provides weight explanations for classifiers, revealing how individual features impact model decisions.
  • QLattice explores a range of potential models, offering comprehensible interpretations of data relationships while promoting adaptability to new patterns.
  • Anchor develops if-then rules that clarify specific decisions, bolstering user confidence in the model’s reasoning processes.

Integrating these techniques offers multifaceted insights, fostering reliability and versatility in interpreting model outputs. The unique characteristics of each XAI method are summarized in Table 6, showcasing their contributions to understanding the complexities of breast cancer prediction models.

For a detailed outline of the methodology followed throughout the research, refer to the flow diagram in Figure 8.


This article provides an informative breakdown of the UCTH Breast Cancer Dataset and processes used to prepare data for machine learning, along with insights into the application of XAI techniques in breast cancer diagnosis.

Read more

Related updates