22 research outputs found
Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases
Recent advances of information technology in biomedical sciences and other
applied areas have created numerous large diverse data sets with a high
dimensional feature space, which provide us a tremendous amount of information
and new opportunities for improving the quality of human life. Meanwhile, great
challenges are also created driven by the continuous arrival of new data that
requires researchers to convert these raw data into scientific knowledge in
order to benefit from it. Association studies of complex diseases using SNP
data have become more and more popular in biomedical research in recent years.
In this paper, we present a review of recent statistical advances and
challenges for analyzing correlated high dimensional SNP data in genomic
association studies for complex diseases. The review includes both general
feature reduction approaches for high dimensional correlated data and more
specific approaches for SNPs data, which include unsupervised haplotype
mapping, tag SNP selection, and supervised SNPs selection using statistical
testing/scoring, statistical modeling and machine learning methods with an
emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics
Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Recommended from our members
Genomic, patterns of selection and differentiation in African populations and implications for mapping disease association
The main objective of this thesis is to gain a better understanding of genomic patterns of natural selection and population differentiation in Africa, where there is great genetic diversity, and of the implications for genetic mapping of complex diseases.
I began by studying two neighbouring villages in eastern Sudan that are of different ethnicity, Hausa and Masalit, and that appear to have different susceptibility to malaria and visceral leishmaniasis (VL). Specifically, I investigated patterns of linkage disequilibrium (LD) and haplotypic signals of positive selection in the 5q31 genomic region which contains immune genes that have been implicated in susceptibility to malaria and VL.
In my first analysis, by genotyping 34 single nucleotide polymorphisms (SNPs) in the 5q31 region, I did not find signals of selection or population differentiation between the Hausa and Masalit using available statistical methods. I conceived the idea that patterns of LD might provide a more sensitive test of population differentiation, and I developed an approach for this using permutation analysis. This method revealed differentiation between the Hausa, the Masalit and other African ethnic groups.
To better understand signals of selection, I next studied a region of the genome associated with a known malaria resistance factor, the haemoglobin S (HbS) variant of the HBB gene. By genotyping 26 SNPs in the region of the HBB gene, I observed a haplotype that extended in excess of 1 Mb, despite being at high frequency and spanning several recombinational hotspots. This long haplotype carried the HbS allele but, importantly, it could be readily detected without typing the HbS variant.
Building on this observation, I designed a new method to screen the whole genome for long haplotypes that might be signals of selection, and developed a software programme to implement this method. I validated this method using haplotypic data for the Yoruba generated by the HapMap project and complemented by additional SNP data that I generated on HapMap cell lines, and found that the HbS allele resides on a haplotype that extends to 1.2 Mb, and is at strikingly high frequency compared to other haplotypes of similar length on the same chromosome.
Next I applied this method to a large family-based association study of severe malaria in The Gambia, and identified several novel genomic regions with unusually long haplotypes of high frequency. These included a number of regions that may be associated with resistance to severe malaria, and which merit further investigation
Parsimony-based genetic algorithm for haplotype resolution and block partitioning
This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster
Statistical perspectives on dependencies between genomic markers
To study the genetic impact on a quantitative trait, molecular markers are used as predictor variables in a statistical model. This habilitation thesis elucidated challenges accompanied with such investigations. First, the usefulness of including different kinds of genetic effects, which can be additive or non-additive, was verified. Second, dependencies between markers caused by their proximity on the genome were studied in populations with family stratification. The resulting covariance matrix deserved special attention due to its multi-functionality in several fields of genomic evaluations
A Haplotype-Based Permutation Approach in Gene-Based Testing
The soaring cost of health care is the biggest public health issue facing our country today. Development of strategies that improve the delivery of health care by identifying high risk individuals for a disease is a major approach to better utilize limited medical resources. Incorporating genomic data into risk stratification models is an essential component for creating these diagnostic and treatment strategies. Although initially applied to just small subsets of disease, advances in technology are making it economically feasible to utilize a patient's genomic data in a wider range of medical disorders. Current genetic association studies are crucial for identifying which loci to include in these models.
Genome Wide Association Studies (GWAS) are a valuable tool for identifying genetic variants associated with disease. Commonly, each SNP is initially independently tested in a GWAS with a univariate analysis. By combining the effects of multiple alleles, multivariate analysis of GWAS may increase power to detect associations and, thus, identify additional risk loci. We employ a haplotype block analysis within genes boundaries for a newly developed gene-based method, “GeneBlock”. GeneBlock is compared in a power analysis with two previously published permutation algorithms (GWiS and Fisher) and a simulation method (Vegas). All methods are tested in an Alzheimer Disease GWAS consisting of 1334 cases and 1475 controls. Results from the Alzheimer’s analysis were subsequently compared with haplotype and univariate analysis.
Power analyses shows both GeneBlock and GWiS as more powerful methods than Vegas and Fisher. A combinational approach involving the selection of the lowest p-value from Vegas, GWiS, and Geneblock has higher power than any individual method even when controlling for the additional multiple comparisons. Fisher and Vegas identify no significant genes in the Alzheimer’s GWAS, while GWiS and Geneblock identified four (PRDM16, ARHGEF16, HLA-DRA, TRAF1) and three (C17orf51, MGC29506, SLC23A1) respectively. The combination method is also most powerful in the real GWAS data; it identified all seven of the above significant genes. Comparing single, haplotype, and gene level analyses revealed that only about 1/3 of the top 100 genes are shared, indicating a large variance in results between methods
Genotype imputation as a genomic strategy for the SA Drakensberger beef breed
Indigenous breeds such as the South African (SA) Drakensberger are economically important genetic resources in local beef production because of their adaptive traits and ability to perform competitively at a commercial level. Genomic selection (GS) is a promising technology to accelerate genetic progress in traits relevant to commercial beef production. A major obstacle in applying this methodology has been the cost of genotyping at high densities of single nucleotide polymorphisms (SNPs). Cost reduction can be achieved by exploiting genotype imputation in GS workflows by means of genotyping at lower densities and imputing upwards. The overarching aim of this study was to conduct an investigation into the practicality of applying imputation in such a workflow utilizing genotypic data for 1 135 SA Drakensberger animals genotyped for 139 480 SNPs. As a pre-imputation step, the objective was firstly to elucidate inter- and intra-chromosomal patterns in genomic characteristics that may contribute to variability in achievable imputation accuracy across the genome. Inter-chromosomal differences in the proportion of low minor allele frequency (MAF) SNPs estimated varied from 6.6% for Bos Taurus autosome (BTA) 23 to 16.0% for BTA14. Pairwise linkage disequilibrium (LD), between adjacent SNPs, ranged from r2=0.11 (BTA28) to 0.17 (BTA14). The largest run of homozygosity (ROH), located on BTA13, was 225.82 kilobases (kb) in length and was identified in 23% of the animals sampled. The ROH-based inbreeding coefficients (FROH) estimated (e.g. FROH>1Mb=0.07, where FROH>1Mb denotes FROH calculated for all ROH longer than 1 megabase pair), indicated sufficient within-breed relatedness to achieve accurate imputation. During the imputation step, imputation accuracy from several custom-derived lower density panels varying in SNP density and the SNP selection strategy were compared. Imputation accuracy increased as SNP density increased; a genotyping panel consisting of 10 000 SNPs, selected based on a combination of their MAF and LD with neighbouring SNPs, could be used to achieve <3% imputation error on average. At this density of SNPs, a mean correlation coefficient (±standard deviation) between true- and imputed SNPs of 0.972±0.024 was achieved in a set of validation animals (n=235). Low MAF SNPs were imputed with lesser accuracy; a difference of 0.071 units was observed between the mean accuracy of imputed SNP categorized into low- (0.01<MAF≤0.1) versus high MAF (0.4<MAF<0.5) classes. Post-imputation, the utility of imputed genotypes in genomic breeding value (GEBV) estimation was evaluated by comparing prediction accuracies achieved from the use of true versus imputed SNPs in generating the H-inverse matrix applied in single-step GS. Breeding values were estimated for two growth traits, considering direct and maternal components. Prediction accuracies were improved by using genomic information in addition to traditional pedigree information; the largest improvement (0.026 units increase in accuracy) was observed for maternal birth weight. Marginal differences were observed between GEBV accuracies produced from true (GEBV_TRUE) versus imputed genotypes (GEBV_IMPUTED); for example the mean±standard deviation in GEBV_TRUE=0.774±0.056 versus GEBV_IMPUTED=0.773±0.055 accuracy was observed for direct birth weight, suggesting that imputation errors had an almost negligible influence. Results presented in this thesis demonstrated the usefulness of imputation as a viable genomic strategy towards low-cost implementation of genomically enhanced prediction of EBVs for a breed such as the SA Drakensberger.Thesis (PhD)--University of Pretoria, 2020.Animal and Wildlife SciencesPhD (Animal Science)Unrestricte
Recommended from our members
Topics in Signal Processing: applications in genomics and genetics
The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown parameters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant context, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diversity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium
Towards Personalized Medicine: Computational Approaches to Support Drug Design and Clinical Decision Making
The future looks bright for a clinical practice that tailors the
therapy with the best efficacy and highest safety to a patient. Substantial
amounts of funding have resulted in technological advances regarding
patient-centered data acquisition --- particularly genetic data. Yet, the
challenge of translating this data into clinical practice remains open.
To support drug target characterization, we developed a global maximum
entropy-based method that predicts protein-protein complexes including the
three-dimensional structure of their interface from sequence data. To further
speed up the drug development process, we present methods to reposition drugs
with established safety profiles to new indications leveraging paths in
cellular interaction networks. We validated both methods on known data,
demonstrating their ability to recapitulate known protein complexes and
drug-indication pairs, respectively.
After studying the extent and characteristics of genetic variation with a
predicted impact on protein function across 60,607 individuals, we showed that
most patients carry variants in drug-related genes. However, for the majority
of variants, their impact on drug efficacy remains unknown. To inform
personalized treatment decisions, it is thus crucial to first collate knowledge
from open data sources about known variant effects and to then close the
knowledge gaps for variants whose effect on drug binding is still not
characterized. Here, we built an automated annotation pipeline for
patient-specific variants whose value we illustrate for a set of patients with
hepatocellular carcinoma. We further developed a molecular modeling protocol to
predict changes in binding affinity in proteins with genetic variants which we
evaluated for several clinically relevant protein kinases.
Overall, we expect that each presented method has the potential to advance
personalized medicine by closing knowledge gaps about protein interactions and
genetic variation in drug-related genes. To reach clinical applicability,
challenges with data availability need to be overcome and prediction
performance should be validated experimentally.Therapien mit der besten Wirksamkeit und höchsten
Sicherheit werden in Zukunft auf den Patienten zugeschnitten werden. Hier haben
erhebliche finanzielle Mittel zu technologischen Fortschritten bei der
patientenzentrierten Datenerfassung geführt, aber diese Daten in die
klinische Praxis zu übertragen, bleibt aktuell noch eine Herausforderung.
Um die Wirkstoffforschung in der Charakterisierung therapeutischer Zielproteine
zu unterstützen, haben wir eine Maximum-Entropie-Methode entwickelt,
die Protein-Interaktionen und ihre dreidimensionalen Struktur
aus Sequenzdaten vorhersagt. Darüber hinaus, stellen wir Methoden
zur Repositionierung von etablierten Arzneimitteln auf
neue Indikationen vor, die Pfade in zellulären Interaktionsnetze nutzen.
Diese Methoden haben wir anhand bekannter Daten validiert und ihre Fähigkeit
demonstriert, bekannte Proteinkomplexe bzw. Wirkstoff-Indikations-Paare zu
rekapitulieren.
Unsere Analyse genetischer Variation mit einem Einfluss auf die
Proteinfunktion in 60,607 Individuen konnte zeigen, dass nahezu jeder Patient
funktionsverändernde Varianten in Medikamenten-assoziierten Genen
trägt. Der direkte Einfluss der meisten beobachteten Varianten auf die
Medikamenten-Wirksamkeit ist jedoch noch unbekannt. Um dennoch personalisierte
Behandlungsentscheidungen treffen zu können, präsentieren wir eine Annotationspipeline für genetische
Varianten, deren Wert wir für Patienten mit hepatozellulärem
Karzinom illustrieren konnten. Darüber hinaus haben wir ein molekulares
Modellierungsprotokoll entwickelt, um die Veränderungen in der
Bindungsaffinität von Proteinen mit genetischen Varianten voraussagen.
Insgesamt sind wir davon überzeugt, dass jede der vorgestellten Methoden das
Potential hat, Wissenslücken über Proteininteraktionen und
genetische Variationen in medikamentenbezogenen Genen zu schlie{\ss}en und
somit das Feld der personalisierten Medizin voranzubringen. Um klinische
Anwendbarkeit zu erreichen, gilt es in der Zukunft, verbleibende
Herausforderungen bei der Datenverfügbarkeit zu bewältigen und unsere
Vorhersagen experimentell zu validieren
Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases
Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have afforded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants.
Machine learning-based feature selection algorithms have been shown to have the ability to effectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including filter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be effective at predicting the disease phenotypes, but also doing so efficiently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets.
Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.Siirretty Doriast