Cohort Description
In a groundbreaking multicenter retrospective study aimed at advancing our understanding of colorectal cancer (CRC), we gathered Hematoxylin and Eosin (H&E)-stained slides from a variety of formalin-fixed, paraffin-embedded tissue samples sourced from seven distinct CRC patient cohorts. The majority of these cohorts originated from five prominent Chinese institutions: Anhui Provincial Hospital (APH), Fudan University Shanghai Cancer Center (FUSCC), the Real-World counterpart of the same institution (FUSCC-RD), Ningbo Pathology Center (NBPC), Shanghai General Hospital (SHPH), and Zhejiang Cancer Hospital (ZCH). Samples were meticulously collected from January 2017 through February 2024, predominantly reflecting the demographics of the Chinese population. Ethical approval for this study was granted by the ethics committee of the lead institution, FUSCC (Approval No. 2306276-4), and the endeavor adhered strictly to the ethical principles outlined in the Declaration of Helsinki, with all participating patients providing written informed consent.
Adding a layer of diversity to our research, the seventh cohort was derived from the public database of The Cancer Genome Atlas (TCGA), which consists of Western population samples. The microsatellite instability (MSI) status for these cases was determined by meticulously integrating clinical information sourced from cBioPortal with corresponding pathology reports obtained from the Genomic Data Commons (GDC) portal. This multi-ethnic representation—spanning both Chinese and Western cohorts—signifies a critical enhancement of the generalizability and clinical applicability of the Deepath-MSI model, ensuring its broad relevance across diverse populations. In total, the study encompassed 5,070 CRC cases, with a detailed dataset description provided in the supplementary material.
The whole slide image (WSI) formats utilized varied across different datasets, including SQDC (Shengqiang, Shenzhen, China), KFB (KFBIO, Ningbo, China), MRXS (3D Histech, Budapest, Hungary), and SVS (Leica, Wetzlar, Germany), reflecting the complexity of modern histopathological practices.
Identification of Region of Interest
A key innovation of our study lies in the novel preprocessing method employed to pinpoint regions of interest (ROIs) within the collected WSIs. The process begins by dividing the WSIs into smaller tiles, each measuring 256 × 256 μm at a resolution of 32 μm/pixel. We then meticulously exclude tiles that contain more than 50% background or artifacts—including blurry areas and any extraneous pen markings.
To enhance artifact detection, we utilized a robust two-stage algorithm: firstly, pixel clustering was applied to group similar regions within low-resolution tiles; secondly, a deep learning model was deployed to accurately identify artifacts. The nuclear-cytoplasmic ratio emerges as a vital indicator in our filtering process, allowing us to designate ROIs more effectively. We then extract corresponding high-resolution tiles (at 0.5 μm/pixel resolution) and annotate MSI status at the patient level. The determination of MSI status for each sample involved cutting-edge techniques including Immunohistochemistry (IHC), Polymerase Chain Reaction (PCR), and next-generation sequencing (NGS). Ultimately, each case was classified by a qualified pathologist, yielding either microsatellite instability-high/deficient mismatch repair (MSI-H/dMMR) or microsatellite stable/proficient mismatch repair (MSS/pMMR).
Feature Extractor
To predict MSI status, we developed an advanced Feature-based Multiple Instance Learning (FMIL) system, which comprises two essential components: a feature extractor and an aggregation module. The feature extractor is pretrained using a state-of-the-art self-supervised learning framework known as DINOv2, renowned for its efficacy in image analysis.
A significant challenge faced in histopathological image analysis revolves around the inconsistencies in color and intensity present among tissue sections, even from the same institution. These discrepancies may stem from varying factors, including differences in tissue fixation, sample size, section thickness, and the staining reagents employed. To combat these issues, we trained the feature extractor on an extensive number of image tiles, employing rigorous data augmentation techniques that simulate diverse conditions. Notably, this training was conducted in a completely label-free manner.
Each image tile undergoes independent processing through the feature extractor, which transforms it into a compact embedding. Importantly, the feature extractor’s parameters remain fixed throughout both the training and inference phases of the FMIL system, thereby guaranteeing consistent feature extraction across all analyses.
Self-Learning Pretraining
The self-supervised pretraining phase within the FMIL system utilizes a student-teacher architecture featuring momentum-updated teacher weights. This structure enables the teacher network to process global crops of each image, while the student network simultaneously handles both global and local crops using varying augmentation strengths.
Global crops are treated with subtler augmentations, such as random resized cropping and horizontal flipping, whereas local crops experience stronger alterations, including color jittering, Gaussian blurring, and solarization. This nuanced approach promotes the learning of more detailed and localized features by the student model.
The comprehensive loss function employed in this architecture involves a weighted combination of cross-entropy loss and iBOT loss, allowing for a well-rounded understanding of the images:
[
{L}{total}={L}{CE}+\lambda {L}_{iBOT}
]
Here, (\lambda) serves to balance the two losses.
Cross-Entropy Loss ((L{CE})) encourages local-to-global correspondence. This is calculated between the student’s outputs for local crops and the teacher’s sharpened, centered output for the global crops. On the other hand, iBOT Loss ((L{iBOT})) promotes invariance to masking, encouraging the model to learn robust, context-aware representations. Both losses involve detailed calculations that further facilitate the accuracy of the feature extraction process.
MSI Classifier
In constructing the MSI classifier, a Multiple Instance Learning (MIL) approach was employed. This methodology, established for handling WSI data, differs from conventional methods by utilizing every tile within a patient’s WSI as a "bag," without making assumptions that every tile reflects the MSI status. This strategy allows for greater resilience against intratumor heterogeneity.
To enhance our model’s performance, we integrated an attention mechanism within the decoder that accesses the complete scope of encoded information. This mechanism serves to assign variable attention weights to different input regions, underscoring the significance of each token and prioritizing specific outputs at each analytical step.
Implementation
The Deepath-MSI model has been implemented using PyTorch and trained on an NVIDIA RTX 4090 GPU equipped with 24 GB of memory. To address class imbalances in patient labels during training, we undertook a random under-sampling of tiles from the more abundant class. This approach ensured an equitable distribution of tiles from both positive and negative classes, facilitating effective training on a tile-level-balanced dataset. Importantly, during model deployment on the test partition or external validation set, no class balancing was applied, allowing for a more accurate assessment of the model’s efficacy.
Statistical Analysis
The statistical endpoints for our study included metrics such as AUROC, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The optimal MSI score threshold for the Deepath-MSI model was determined by setting sensitivity at 95% within the test set. To establish criteria for quality control for real-world validation, we randomly selected a specific number of tiles per slide in the test set, analyzing predictions exclusively for slides that met the minimum tile threshold. The MSI score threshold and minimum tile count were subsequently applied to the real-world validation set, while associations between clinicopathological features and model performance were analyzed using Fisher’s exact test, providing robust insights into the nuanced relationships between these variables.