Feature Selection and Machine Learning Model Reveals a Core Set of 81 AMD Genes
Understanding the Challenge of High Dimensionality in RNA-seq Data
Transcriptome data can offer a wealth of information about cellular functions, but it often suffers from the “curse of dimensionality.” In RNA-seq experiments, researchers typically profile tens of thousands of genes, yet they are often limited to a small number of subjects. This imbalance can lead to overfitting and unreliable biological insights. To tackle this issue, we designed a sophisticated pipeline aimed at reducing dimensionality and enhancing the efficiency and interpretability of subsequent analyses.
Our study analyzed a dataset comprising 105 controls and 61 advanced Age-related Macular Degeneration (AMD) samples. By employing three different feature selection methods—ANOVA (analysis of variance) F-test, AUC (area under the curve), and Kruskal-Wallis test—we systematically filtered the data to identify the most biologically relevant features. We divided the dataset into an 80% training set and a 20% testing set, enhancing our model’s performance and ensuring robust evaluation metrics. After 1,000 iterations of feature selection, we identified a core set of 81 genes—referred to as "ML-genes"—that were consistently significant across all three methods.
Machine Learning in AMD Classification
Once we narrowed down our list to these 81 ML-genes, we employed four Machine Learning (ML) models: Neural Network, Logistic Regression, eXtreme Gradient Boosting (XGB), and Random Forest. The data was randomly partitioned into 64% for training, 16% for validation, and 20% for testing. For classification, we identified the optimal threshold using Youden’s J statistic.
Among the models tested, XGB performed best, yielding an AUC-ROC statistic of 0.80. This robust performance motivated us to focus on XGB for further analyses, ensuring we captured as much meaningful information as possible from our dataset.
Validating the Robustness of the 81 ML-genes
To validate our findings, we compared the performance of the 81 ML-genes against four additional lists:
- Genes within 500KB of 34 known AMD-GWAS loci.
- High-confidence AMD genes associated with established connections to AMD.
- Literature-identified genes relevant to macular degeneration.
- A randomized list generated through label shuffling.
Remarkably, the performance of our 81 ML-genes surpassed all other lists, demonstrating their specificity and relevance to AMD. This solidifies the notion that these genes could play a crucial role in understanding the mechanisms underlying AMD.
Exploring Feature Importance with SHAP Analysis
To deepen our understanding of how each gene contributes to the model’s predictions, we utilized SHAP (Shapley Additive exPlanations) analysis. This approach allowed us to compute the contributions of individual genes to each sample’s prediction, enabling a clearer interpretation of feature importance. Notably, high expression levels of the gene MOXD1 in AMD samples starkly contrasted with lower expression levels in controls, underscoring its significance in AMD pathology.
Unpacking Molecular Heterogeneity in AMD
AMD manifests with diverse clinical presentations, including both dry and wet forms, each with varying disease progression rates. To investigate whether this heterogeneity extends to the molecular level, we employed our XGB model trained on the gene list to predict AMD status across samples. Notably, 74% of control samples were accurately classified compared to only 60% of AMD patients, revealing potential complexities in disease presentation.
Upon identifying samples that were consistently predicted correctly or incorrectly, we categorized them into “70% right” and “70% wrong” groups. This analysis highlighted that mislabeling was more prevalent among AMD patients, suggesting underlying genetic heterogeneity that warrants further exploration.
Gene Co-expression Network Analysis
To illuminate the biological pathways relevant to AMD, we conducted Weighted Gene Co-expression Network Analysis (WGCNA). By performing GO analysis, we identified several biological modules associated with immune response, extracellular matrix organization, and the complement pathways—three areas known to contribute to AMD pathology.
Surprisingly, the majority of our identified ML-genes (62 out of 81) clustered within these key modules, which are implicated in AMD’s progression. This suggests that dysregulated immune responses could play a vital role in AMD’s pathogenesis.
Shared and Unique Gene Signatures Across AMD Stages
AMD is a progressive disease characterized by early, intermediate, and late stages. Our aim was to explore gene signatures across these stages. Utilizing our ML pipeline on early and intermediate AMD samples yielded 57 and 62 genes, respectively, each demonstrating varying predictive power. Early-stage markers performed less effectively, likely due to subtle gene expression changes during these initial stages.
Interestingly, while there was limited overlap in identified genes among the three stages, many were enriched within modules related to immune response and ECM pathways, highlighting a shared biological activation pattern that runs through the disease’s progression.
Insights into Cell Type-Specific Gene Expression
Given the retina’s complex cellular makeup, we sought to understand the impacts of AMD at the cellular level. By deconvoluting our RNA-seq data, we pinpointed significant differences in cell-type distributions between controls and AMD samples. Specifically, there were notable increases in microglial, astrocytic, and Müller glial populations, while rod photoreceptor proportions were diminished, indicating a shift in retinal microenvironment due to AMD pathology.
Further analysis of individual ML-genes revealed a strong expression predominance in astrocytes, microglia, and Müller glia, emphasizing the significance of these cell types in AMD progression.
Genetic Insights from AML-genes
Genetic association studies often struggle to disentangle cause from consequence. To address this, we examined published AMD-GWAS data, using statistical analysis to determine whether variants in our identified ML-genes were linked to AMD. Encouragingly, we found that genes associated with both late and early AMD diagnostics exhibited significant genetic variants, thereby proposing a genetic basis for these ML-genes and their potential contributions to AMD susceptibility.
Conclusion
The methods employed in this study reveal that integrating ML techniques with genomic analyses can yield significant insights into the genetic and molecular mechanisms of AMD. The identified ML-genes show promise not only in understanding disease mechanisms but also in developing targeted therapies for effective intervention against AMD.