Thursday, October 23, 2025

OncoMet: Predicting Oncogenic Pathways and Metastasis in Esophageal Cancer Through Deep Learning of Histopathology Images

Share

Dataset Description for Esophageal and Stomach Carcinoma Study

This article delves into the intricacies of a dataset drawn from whole-slide images (WSIs) of patients diagnosed with esophageal carcinoma and stomach adenocarcinoma. By highlighting the data collection process, its characteristics, pre-processing methods, and the subsequent analytical framework, we aim to elucidate the comprehensive approach undertaken in this study.

Data Collection

For this study, we sourced WSIs from 124 patients with esophageal carcinoma (TCGA-ESCA) using The Cancer Genome Atlas (TCGA) through the GDC data transfer tool. The images were captured at a magnification level of x40, featuring an impressive average pixel density of 100,000 × 80,000 pixels, with an 8-bit color depth. Each pixel corresponds to a spatial resolution of 0.25 micrometers (µm). The WSIs, provided in .svs format, specifically targeted primary tumors stained with hematoxylin and eosin (H&E).

To ensure a robust analysis, the dataset was randomly partitioned into an 80:20 ratio, adhering to the Pareto principle, which is often applied in such studies. In addition to the esophageal carcinoma dataset, we incorporated diagnostic slides from 20 patients with stomach adenocarcinoma (TCGA-STAD). This addition served to validate our methodological framework.

Clinical and Demographic Characteristics

The clinical and demographic characteristics of the TCGA-ESCA and TCGA-STAD cohorts are summarized in Table 1 (not shown here). Notably, the esophageal tumors were predominantly located in the lower third (76 cases) and middle third (32 cases) of the esophagus. In contrast, stomach tumors were primarily found in the cardia (4 cases), body (4 cases), and pyloric antrum (12 cases) of the stomach.

Molecular Analysis

We further conducted a thorough molecular analysis by examining somatic mutation and copy number variation (CNV) data from both TCGA-ESCA and TCGA-STAD cohorts. Our goal was to identify patients at risk for distant metastasis by prioritizing known metastasis-associated biomarkers. Key markers included Matrix Metalloproteinases (MMPs), vascular endothelial growth factor (VEGF), and E-cadherin (CDH1), all of which play crucial roles in processes such as tissue invasion, angiogenesis, and epithelial-to-mesenchymal transition (EMT).

Using cBioPortal for somatic mutation profiles, we identified recurrently mutated genes, observing mutations in critical oncogenes and tumor suppressor genes like TP53, CDKN2A, FAT1, NOTCH1, and PIK3CA. These genes are known to regulate essential cellular pathways relevant to tumor progression and metastasis. CNVs were also examined, focusing on oncogene amplifications and tumor suppressor deletions, as these structural variations frequently associate with aggressive tumor phenotypes.

By submitting the complete list of recurrently mutated genes to Ingenuity Pathway Analysis (IPA), we uncovered 10 top-ranked pathways commonly altered in carcinoma cases and implicated in the metastatic process. These pathways included significant biological processes such as cell cycle dysregulation and apoptosis evasion, which are crucial for cancer progression.

Dataset Pre-processing

To maintain a high standard of quality, we applied strict filtering criteria to exclude low-quality whole-slide images from our training and validation sets. We retained 124 diagnostic WSIs from the original 154 available on TCGA, ensuring visual clarity and overall diagnostic efficacy. Again, we maintained an 80:20 ratio for training and testing, aligning with previous studies to preserve a balanced distribution of metastatic and non-metastatic images.

To address data imbalance, we supplemented our dataset using additional cases from the TCGA-HNSCC cohort, augmenting the minority class representation. This step was critical for developing machine learning models capable of recognizing patterns in underrepresented cases, thus enhancing model robustness and generalizability.

Whole Slide Image Pre-processing

WSIs, due to their large size and pyramidal structure, necessitate careful handling. To balance storage needs with information retention, we processed the data at a 20x magnification level through an image segmentation tool built on the OpenSlide library. This processing yielded 390,532 tiles, essential for effective analysis.

The choice of tile dimensions and overlap is crucial; hence we extracted non-overlapping tiles of 512 × 512 pixels. During feature extraction, these tiles were resized to 224 × 224 pixels. Moreover, tiles with more than 90% background coverage were excluded, following established methodological guidelines.

Addressing Color Variability

Histopathology images often encounter color inconsistencies due to variations in staining, equipment discrepancies, and sectioning inconsistencies. To mitigate these variations, we deployed Reinhard color normalization. This technique transforms the RGB color space into the Lab space, allowing for adjustments in brightness and color channels based on a standard reference image. Through this method, we harmonized color distributions, reinforcing the consistency and accuracy of features extracted by our deep learning models.

OncoMet Algorithm

Overview

The OncoMet algorithm is structured to automate the identification of metastasis and oncogenic signaling pathways within esophageal carcinoma. It encompasses three primary phases:

  1. Feature Extraction: Each tile from the WSIs is processed through the pre-trained ResNet101 network, with the top layer removed. The resulting features, stored for further analysis, yield a feature vector dimensionality of 2048 per tile.

  2. Feature Aggregation: The features extracted from each tile are consolidated into a cohesive representation of the entire slide. Utilizing mathematical equations, we calculate the mean and median of the feature vectors for comprehensive slide representation.

  3. Classification: Post-feature aggregation, we employ various machine learning classifiers like K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest to enhance prediction reliability based on the slide features.

Model Setup and Evaluation

The feature extraction begins with the ResNet101 model, applied consistently across all images post-color normalization. The resulting feature vectors comprise concatenated mean and median values, forming a comprehensive representation for effective classification.

We utilized several quantitative metrics for model evaluation, including accuracy, precision, recall, and F1-score. The Area Under the Curve (AUC) score played a pivotal role in assessing the model’s effectiveness in distinguishing between metastatic and non-metastatic classifications. The ROC curve provided a visual representation of the model’s performance, illuminating its capacity for accurately detecting metastasis in esophageal carcinoma.

By articulating each element of our dataset and methodological framework, we aim to provide a clear and comprehensive understanding of the intricate processes involved in analyzing esophageal and stomach carcinoma. This structured approach not only enhances the reliability of our findings but also contributes significantly to the fields of medical diagnostics and machine learning in oncology.

Read more

Related updates