Thursday, October 23, 2025

Revolutionizing Breast Cancer Diagnosis with Machine Learning

Share

Data Collection in Breast Cancer Research

Introduction to Data Collection

Data collection is an essential foundation for any analytical investigation, especially in the realm of health and clinical studies. This article focuses on a recent investigation leveraging a unique dataset sourced from the Motamed Cancer Institute, a clinical research center in Tehran, Iran, specializing in breast cancer.

The Breast Cancer Dataset

Our initial effort involved gathering 300 records from individuals who visited the Motamed Cancer Institute over the last three years, specifically from 2021 to 2024. Each record contains six features (detailed in Table 1), categorized based on specific devices and separated into two groups indicating the presence or absence of breast cancer. Among these records, a striking 75% indicated the presence of cancer cells.

Features in the Dataset

While the specific model used is not disclosed, the features relevant to breast cancer research typically encompass the following:

  • HER2 Receptor Status: HER2 (Human Epidermal growth factor Receptor 2) is a protein involved in cell growth and division, primarily influencing the growth of breast cancer cells. Tumors exhibiting high levels of HER2 (HER2-positive) account for about 15-20% of all breast cancer cases, demonstrating aggressive growth patterns. Fortunately, advancements in targeted therapies like trastuzumab and lapatinib have improved the prognosis significantly for HER2-positive cancers.

  • Ki-67 Proliferation Rate: Ki-67 serves as both a predictive and prognostic marker in breast cancer. A higher Ki-67 index indicates aggressive tumor behavior and poorer prognosis, while lower levels suggest a more favorable outcome and greater susceptibility to hormone therapy.

  • Estrogen Receptors (ERs): ER-positive tumors respond to estrogen, facilitating tumor growth. Patients with ER-positive tumors typically experience better prognoses due to their tumors’ sensitivity to hormone treatments, such as aromatase inhibitors and tamoxifen.

  • Progesterone Receptor (PR): Similar to ERs, PRs also play an important role in hormonal responses in cancer. Positive PR status usually correlates with more favorable outcomes and greater responsiveness to hormone therapy.

  • Neoadjuvant Therapy: This indicates whether patients received chemotherapy before surgical intervention, an essential aspect of treatment for nearly half of the surveyed participants.

Summary of Data Characteristics

Descriptive statistics help assess and elucidate the attributes of this dataset, enabling the identification of significant patterns among features (refer to Tables 2 and 3). Visual representations further clarify the distribution of variables, as shown in Figure 6.

Preprocessing Data

In the initial stages of data preparation, records associated with triple-negative breast cancer were excluded. Furthermore, a new column indicating cancer presence was created based on the cancer diagnostic path (CDP) number, where positive outcomes were marked as "1" and negative as "0".

Mahalanobis Distance Analysis

One important step in identifying outliers was the application of Mahalanobis distance, a statistical measure that analyzes the distance of data points from a distribution center based on the covariance matrix. This analysis can help identify points deviating significantly from the main data distribution, labeling them as potential outliers.

Principal Component Analysis (PCA)

Principal Component Analysis serves as a powerful statistical technique aimed at reducing dimensionality by transforming correlated features into a set of uncorrelated variables. By retaining those features that explain maximum variance, PCA simplifies data without sacrificing essential information. The process begins with standardization, forming a covariance matrix, and performing eigenvalue decomposition to discern the principal components.

Illustratively, the PCA plot (Figure 7) demonstrates the data distribution before any balancing measures were applied.

Handling Class Imbalance with SMOTE

Class imbalance presents a common challenge in machine learning contexts. Utilizing the Synthetic Minority Over-sampling Technique (SMOTE), our analysis generated synthetic instances for the under-represented class, effectively balancing the dataset. Initially, the minority group contained 76 individuals compared to the majority group of 224. Post-application of SMOTE, both classes achieved an equal representation of 179 individuals, as shown in Figure 8.

Implementing Machine Learning Algorithms

In our analytical framework, we deployed various supervised machine learning algorithms to accomplish predictive analysis. This approach aims to leverage patterns in the dataset to predict outcomes.

Support Vector Machines (SVM)

A versatile and formidable tool in supervised learning, SVMs aim to identify the optimal hyperplane that separates data points of different classes. By maximizing the margin between classes, SVMs provide robust classification capabilities. Commonly utilized with kernel functions, SVMs can adapt to various classification tasks, enabling effective differentiation between classes.

Random Forest

Random Forest constitutes an ensemble learning technique combining multiple decision trees to improve predictive accuracy and reduce overfitting. By aggregating predictions across varying trees trained on random subsets of data, this method enhances reliability, especially when addressing complex datasets.

Logistic Regression

Frequently employed for binary classification tasks, logistic regression predicts the likelihood that a given instance belongs to a particular class. By utilizing a logistic function on a linear combination of input features, this method offers insight into class probabilities.

Decision Trees

This algorithm iteratively partitions the data based on feature characteristics to maximize the purity of generated subsets. Decision trees focus on splitting features to enhance classification accuracy, making them interpretable while supporting simplified explanations of predictive models.

Evaluation Methods

To assess the effectiveness of our models, three evaluation methods were employed: accuracy, recall, and precision. These metrics provide a comprehensive understanding of model performance and predictive reliability in real-world applications.

Understanding Evaluation Metrics:

  • Accuracy: Represents the overall correctness of the model.
  • Precision: Measures the number of true positives relative to the total predicted positives.
  • Recall: Quantifies how many actual positives were correctly identified.

Ethical Considerations

All methods used in this study adhered to established guidelines and regulations, receiving approval from the National Ethics Committee of Iran. Informed consent was obtained from all participants, ensuring transparency regarding study objectives and procedures.


This structured exploration emphasizes the critical aspects of data collection, preparation, and analysis in breast cancer research. It highlights the significance of effective data management and the implementation of robust machine learning techniques to yield meaningful insights in medical research.

Read more

Related updates