Sunday, July 20, 2025

Revolutionizing Prostate Cancer Treatment: Uncovering DNA Metabolism Biomarkers with Machine Learning

Share

Data Sources for Gene Expression Analysis in Prostate Cancer

In the realm of cancer research, particularly prostate cancer, the Gene Expression Omnibus (GEO) offers a treasure trove of publicly available datasets. For the present investigation, we utilized several datasets, namely GSE55945, GSE10474, GSE46602, and GSE32571. Each dataset harbors valuable information that elucidates the genetic differences between prostate cancer specimens and their normal counterparts. The research incorporated 36 prostate cancer specimens and equivalently matched normal tissue samples, providing a comprehensive basis for the analysis. Detailed information about these datasets can be found in Table 1, which serves as a quick reference for the significant characteristics of each dataset.

To further enhance the robustness of our findings, we supplemented our primary datasets by analyzing four additional datasets—GSE21036, GSE34933, GSE48430, and GSE212215. This augmentation of data ensured that the differentially expressed genes related to DNA metabolism were consistently identified across diverse cohorts. Sample selection for each dataset was meticulously done, focusing on the availability of both tumor and adjacent normal tissues, consolidating adequate sample sizes, and ensuring consistent platform annotations. In scenarios where datasets contained larger sample sizes, a matched subset was selected to maintain uniformity in terms of metadata and expression values.

Data Preprocessing Using RStudio

All analyses in this study were conducted using RStudio (version 2022.12.0 or as specified), implementing R version 4.2.2. The analytical process leveraged an array of powerful R packages tailored for gene expression analysis, including edgeR (v3.38.4), limma (v3.52.4), Seurat (v4.3.0), and many others to facilitate different aspects of data analysis, from clustering to visualization.

Prior to any analytical exploration, preprocessing of the gene expression data was critical. The Seurat package was employed to ensure high-quality cells were considered for subsequent analyses. Cells with fewer than 55 detectable genes, more than 5% mitochondrial genes, or fewer than three detectable genes were filtered out to eliminate low-quality or potentially confounding data.

Following the data filtration, we proceeded with normalization of gene profiles. The analysis focused on the 1,600 genes exhibiting the highest variability as indicated by JackStraw analysis. These genes were subjected to principal component analysis (PCA), significantly reducing dimensionality while retaining meaningful biological information.

Identification of Differentially Expressed Genes

The clustering of gene expression data was conducted using R’s FindClusters function with a resolution value set at 0.5. The t-distributed stochastic neighbor embedding (t-SNE) technique was employed for visualizing the clustered data, enabling us to discern distinct gene expression patterns among the samples. To identify marker genes associated with each cluster, the FindAllMarkers function was utilized, where a stringent threshold of an adjusted P value of <0.01 and |log FC|>1.3 was established.

Furthermore, the SingleR program was utilized for annotating cell types based on their expression profiles, further elucidating the biological significance of the identified clusters.

Construction of Protein-Protein Interaction Networks

To delve deeper into the functional implications of the identified DNA metabolism-related genes, we constructed a protein-protein interaction (PPI) network using the STRING database (version 11.5). This database encompasses known and predicted interactions based on various criteria including experimental data and co-expression. Genes exhibiting an interaction confidence score ≥0.4 were included in the network. The PPI network was visualized using Cytoscape (v3.9.1), allowing for a comprehensive view of the interaction landscape.

Employing the CytoHubba plugin, we identified the top hub genes in the network based on the degree algorithm, facilitating further exploration of key genes integral to DNA metabolism. Subsequently, Gene Ontology (GO) enrichment analysis was performed using the clusterProfiler R package, focusing on various categories including biological process (BP), cellular component (CC), and molecular function (MF).

Consensus Clustering Analysis

To categorize patients within the TCGA cohort based on distinct DNA metabolism gene expression patterns, we adopted a consensus clustering approach. This involved employing agglomerative pam clustering in conjunction with favorable metrics such as 1-Pearson correlation distance. By conducting 1,000 resampling iterations, we established a consistency matrix that aided in pinpointing optimal cluster numbers.

The variations in biochemical recurrence-free survival (bRFS) rates among different clusters were assessed alongside clinical variables like Gleason scores and PSA levels via chi-square tests. This analysis enabled insights into the relationships between gene expression patterns and clinical outcomes typical of prostate cancer.

Gene Set Variation Analysis

Employing the Gene Set Variation Analysis (GSVA) package, we calculated gene expression profiles for DNA metabolism genes across the TCGA cohort. This included sourcing gene sets from well-established databases such as the Kyoto Encyclopedia of Genes and Genomes and Gene Ontology. The enrichment scores for each gene set pathway were subsequently calculated, allowing us to visualize differential enrichment scores across clusters. Heatmaps were generated to depict these scores, spotlighting statistically significant pathways that provide a deeper understanding of the biological processes at play.

Weighted Gene Co-expression Network Analysis

To further identify DNA metabolism-related genes with potential clinical relevance, we curated a list based on the established Gene Ontology for DNA metabolic processes. Utilizing the WGCNA approach, we constructed a gene co-expression network derived from the TCGA prostate cancer expression dataset, specifically selecting genes with the top 25% variance.

A critical step in this analysis was determining an appropriate soft-thresholding power (β) that adhered to a scale-free topology criterion. By constructing a topological overlap matrix, we measured network connectivity, and employing hierarchical clustering enabled us to identify specific gene modules. Each module was associated with its eigengene, allowing us to correlate them with clinical traits, focusing on traits such as Gleason score and recurrence status.

Tumor Immune Microenvironment Analysis

Understanding the tumor immune microenvironment (TIME) is imperative in cancer research. To compare immune characteristics between high-risk and low-risk clusters, we estimated tumor purity, stromal scores, immunological scores, and overall ESTIMATE scores. Employing the ssGSEA algorithm, we assessed immune cell infiltration across various immune cell types, cross-validating results with additional algorithms like TIMER and CIBERSORT to ensure reliability.

Given the dynamic nature of the immune response in tumors, we also examined immune modulators across different clusters, which included a variety of immunostimulators, chemokines, and receptor markers.

Machine Learning-Based Signature Construction and Validation

In a bid to create a reliable prognostic signature, a robust machine-learning approach was undertaken involving ten unique algorithms tested across 101 permutations. The goal was to generate a signature that exhibited stability and accuracy. Initial univariate Cox regression analysis identified potential prognostic DNA metabolism genes, which were then subjected to rigorous validation.

We utilized a leave-one-out cross-validation framework on the TCGA cohort, culminating in multiple models aimed at identifying the most robust predictor of patient outcomes. After assessing the models using Harrell’s C-index, the optimal model was applied to stratify patients further into high and low-risk categories.

Evaluation of Clinical Significance and Comparison of Published Signatures

To enhance our findings, we engaged in comparative analyses with already established DNA metabolism gene signatures in prostate cancer. This involved thorough literature searches to extract relevant models and then applying rigorous statistical approaches to benchmark their efficacy.

Immunotherapy Response and Drug Sensitivity Assessment

We integrated mutation profiling using TCGA somatic mutation data to explore differences in immunotherapy responses between high- and low-risk groups identified by DNA metabolism genes. Utilizing resources such as the IMvigor 210 cohort, we assessed survival outcomes, further expanding our understanding of potential therapeutic vulnerabilities within the context of DNA metabolism.

Overall, our multi-dimensional approach traversed various analytical techniques, ensuring comprehensive insights were gleaned from a multitude of datasets, ultimately aiming to enhance our understanding of DNA metabolism’s role in prostate cancer biology.

Read more

Related updates