More than 15 years after initial sequencing of the human genome, exome and whole genome sequencing are widely performed for research and clinical applications. Despite much progress, pinpointing the few phenotypically causal variants among the millions of variants in our genomes remains a major challenge. To illustrate that problem, the NCBI dbSNP and ClinVar databases report almost 700 million variants discovered in healthy and diseased humans, but only about 500,000 (<0.1%) are clinically or functionally characterized. For diagnostics, it is critical to identify the causal variants among thousands of variants with no physiological effect.
Many algorithms for predicting functional impact of variants were proposed in the past, but are largely limited to highly conserved positions in protein-coding sequences and do not interpret variants genome wide. Kircher et al. previously developed a computational method (Combined Annotation Dependent Depletion, CADD) that combines diverse annotations – from large-scale epigenetic experiments, comparisons of genomes across species, to gene model annotations. Using a linear model, CADD integrates available information in a unified framework and quantifies organismal deleteriousness on a whole-genome and variant-specific scale. While CADD has been successfully applied in thousands of disease studies, its best performance is still observed for the interpretation of variants in and around protein coding genes.
Reasons for that are manifold, ranging from the lack of domain-specific features (e.g. non-coding sequence species, regulatory sequences, 3D genome architecture) to shortcomings in the actual model (e.g. non-linearity, missing feature interactions, mislabeling of the training data). Here, we propose a joint effort, between a group that routinely analyses and interprets genetic data from individual patients, families or large research cohorts and a research group developing computational methods for variant prioritization, to significantly improve the current method and advance the automatic reporting of potentially clinically relevant variants from genetic data. For this purpose, our project has the following aims: (1) Establishing model training for unlabeled or mislabeled data, for example by semi-supervised and iterative learning approaches. (2) Systematic exploration of feature interactions and non-linearity, including but not limited to alternative learning approaches like boosting trees and neural networks, feature transformations, or automated selection of interactions terms. (3) Integration of large sets of correlated annotations (e.g. ENCODE, IHEC) through dimensionality reduction, orthogonalization approaches, parallelization and training via subsampling, or hierarchical integration of models. (4) Using additional and genome-wide available measures of sequence constraint like population variant density and sequence-dependent mutational load, to complement species conservation.