Feature Selection and Machine Learning Model Reveals a Core Set of 81 AMD Genes

Understanding the Challenge of High Dimensionality in RNA-seq Data

Transcriptome data can offer a wealth of information about cellular functions, but it often suffers from the “curse of dimensionality.” In RNA-seq experiments, researchers typically profile tens of thousands of genes, yet they are often limited to a small number of subjects. This imbalance can lead to overfitting and unreliable biological insights. To tackle this issue, we designed a sophisticated pipeline aimed at reducing dimensionality and enhancing the efficiency and interpretability of subsequent analyses.

Our study analyzed a dataset comprising 105 controls and 61 advanced Age-related Macular Degeneration (AMD) samples. By employing three different feature selection methods—ANOVA (analysis of variance) F-test, AUC (area under the curve), and Kruskal-Wallis test—we systematically filtered the data to identify the most biologically relevant features. We divided the dataset into an 80% training set and a 20% testing set, enhancing our model’s performance and ensuring robust evaluation metrics. After 1,000 iterations of feature selection, we identified a core set of 81 genes—referred to as "ML-genes"—that were consistently significant across all three methods.

Machine Learning in AMD Classification

Once we narrowed down our list to these 81 ML-genes, we employed four Machine Learning (ML) models: Neural Network, Logistic Regression, eXtreme Gradient Boosting (XGB), and Random Forest. The data was randomly partitioned into 64% for training, 16% for validation, and 20% for testing. For classification, we identified the optimal threshold using Youden’s J statistic.

Among the models tested, XGB performed best, yielding an AUC-ROC statistic of 0.80. This robust performance motivated us to focus on XGB for further analyses, ensuring we captured as much meaningful information as possible from our dataset.

Validating the Robustness of the 81 ML-genes

To validate our findings, we compared the performance of the 81 ML-genes against four additional lists:

Genes within 500KB of 34 known AMD-GWAS loci.
High-confidence AMD genes associated with established connections to AMD.
Literature-identified genes relevant to macular degeneration.
A randomized list generated through label shuffling.

Remarkably, the performance of our 81 ML-genes surpassed all other lists, demonstrating their specificity and relevance to AMD. This solidifies the notion that these genes could play a crucial role in understanding the mechanisms underlying AMD.

Exploring Feature Importance with SHAP Analysis

To deepen our understanding of how each gene contributes to the model’s predictions, we utilized SHAP (Shapley Additive exPlanations) analysis. This approach allowed us to compute the contributions of individual genes to each sample’s prediction, enabling a clearer interpretation of feature importance. Notably, high expression levels of the gene MOXD1 in AMD samples starkly contrasted with lower expression levels in controls, underscoring its significance in AMD pathology.

Unpacking Molecular Heterogeneity in AMD

AMD manifests with diverse clinical presentations, including both dry and wet forms, each with varying disease progression rates. To investigate whether this heterogeneity extends to the molecular level, we employed our XGB model trained on the gene list to predict AMD status across samples. Notably, 74% of control samples were accurately classified compared to only 60% of AMD patients, revealing potential complexities in disease presentation.

Upon identifying samples that were consistently predicted correctly or incorrectly, we categorized them into “70% right” and “70% wrong” groups. This analysis highlighted that mislabeling was more prevalent among AMD patients, suggesting underlying genetic heterogeneity that warrants further exploration.

Gene Co-expression Network Analysis

To illuminate the biological pathways relevant to AMD, we conducted Weighted Gene Co-expression Network Analysis (WGCNA). By performing GO analysis, we identified several biological modules associated with immune response, extracellular matrix organization, and the complement pathways—three areas known to contribute to AMD pathology.

Surprisingly, the majority of our identified ML-genes (62 out of 81) clustered within these key modules, which are implicated in AMD’s progression. This suggests that dysregulated immune responses could play a vital role in AMD’s pathogenesis.

Shared and Unique Gene Signatures Across AMD Stages

AMD is a progressive disease characterized by early, intermediate, and late stages. Our aim was to explore gene signatures across these stages. Utilizing our ML pipeline on early and intermediate AMD samples yielded 57 and 62 genes, respectively, each demonstrating varying predictive power. Early-stage markers performed less effectively, likely due to subtle gene expression changes during these initial stages.

Interestingly, while there was limited overlap in identified genes among the three stages, many were enriched within modules related to immune response and ECM pathways, highlighting a shared biological activation pattern that runs through the disease’s progression.

Insights into Cell Type-Specific Gene Expression

Given the retina’s complex cellular makeup, we sought to understand the impacts of AMD at the cellular level. By deconvoluting our RNA-seq data, we pinpointed significant differences in cell-type distributions between controls and AMD samples. Specifically, there were notable increases in microglial, astrocytic, and Müller glial populations, while rod photoreceptor proportions were diminished, indicating a shift in retinal microenvironment due to AMD pathology.

Further analysis of individual ML-genes revealed a strong expression predominance in astrocytes, microglia, and Müller glia, emphasizing the significance of these cell types in AMD progression.

Genetic Insights from AML-genes

Genetic association studies often struggle to disentangle cause from consequence. To address this, we examined published AMD-GWAS data, using statistical analysis to determine whether variants in our identified ML-genes were linked to AMD. Encouragingly, we found that genes associated with both late and early AMD diagnostics exhibited significant genetic variants, thereby proposing a genetic basis for these ML-genes and their potential contributions to AMD susceptibility.

Conclusion

The methods employed in this study reveal that integrating ML techniques with genomic analyses can yield significant insights into the genetic and molecular mechanisms of AMD. The identified ML-genes show promise not only in understanding disease mechanisms but also in developing targeted therapies for effective intervention against AMD.

The Symbolic Strategy Letter

Premium features

Uncovering Cell-Type Specific Immune Signatures in Macular Degeneration Through Explainable Machine Learning and Transcriptomics

Feature Selection and Machine Learning Model Reveals a Core Set of 81 AMD Genes

Understanding the Challenge of High Dimensionality in RNA-seq Data

Machine Learning in AMD Classification

Validating the Robustness of the 81 ML-genes

Exploring Feature Importance with SHAP Analysis

Unpacking Molecular Heterogeneity in AMD

Gene Co-expression Network Analysis

Shared and Unique Gene Signatures Across AMD Stages

Insights into Cell Type-Specific Gene Expression

Genetic Insights from AML-genes

Conclusion

Table of contents [hide]

Boosting Results: Merging Computer Science with Culturally Responsive Education

Unlocking Consumer Insights: 3 Ways Retail Banks Can Leverage Natural Language Processing

Netflix Expands Its Generative AI Strategy for Streaming and Production

How to Create a Client Onboarding Checklist for Freelancers

Amazon Launches AI-Enhanced Augmented Reality Glasses for Delivery Drivers

Related updates

Exploring SU(d)-Symmetric Random Unitaries: Quantum Scrambling, Error Correction, and Machine Learning

Predicting N2 Lymph Node Metastasis in Non-Small Cell Lung Cancer Using Machine Learning

Interpretable Machine Learning for Classifying Metal Passivity from Minimal EIS Data

Optimizing Lithofacies Prediction in the Lower Goru Formation Using Diverse Machine Learning Algorithms

Boosting Results: Merging Computer Science with Culturally Responsive Education

Unlocking Consumer Insights: 3 Ways Retail Banks Can Leverage...

Netflix Expands Its Generative AI Strategy for Streaming and...

How AI and Consumer Trends are Transforming Global E-Commerce

Dynamic Spatio-Temporal Receptive Fields for Spiking Neural Networks

AI-Powered Filmmaker Hub: Dream Lab LA