8 research outputs found
Multi-population GWA mapping via multi-task regularized regression
Motivation: Population heterogeneity through admixing of different founder populations can produce spurious associations in genome- wide association studies that are linked to the population structure rather than the phenotype. Since samples from the same population generally co-evolve, different populations may or may not share the same genetic underpinnings for the seemingly common phenotype. Our goal is to develop a unified framework for detecting causal genetic markers through a joint association analysis of multiple populations
Multi-population GWA mapping via multi-task regularized regression.
MOTIVATION: Population heterogeneity through admixing of different founder populations can produce spurious associations in genome-wide association studies that are linked to the population structure rather than the phenotype. Since samples from the same population generally co-evolve, different populations may or may not share the same genetic underpinnings for the seemingly common phenotype. Our goal is to develop a unified framework for detecting causal genetic markers through a joint association analysis of multiple populations.
RESULTS: Based on a multi-task regression principle, we present a multi-population group lasso algorithm using L(1)/L(2)-regularized regression for joint association analysis of multiple populations that are stratified either via population survey or computational estimation. Our algorithm combines information from genetic markers across populations, to identify causal markers. It also implicitly accounts for correlations between the genetic markers, thus enabling better control over false positive rates. Joint analysis across populations enables the detection of weak associations common to all populations with greater power than in a separate analysis of each population. At the same time, the regression-based framework allows causal alleles that are unique to a subset of the populations to be correctly identified. We demonstrate the effectiveness of our method on HapMap-simulated and lactase persistence datasets, where we significantly outperform state of the art methods, with greater power for detecting weak associations and reduced spurious associations.
AVAILABILITY: Software will be available at http://www.sailing.cs.cmu.edu/.</p
Mining brain imaging and genetics data via structured sparse learning
Indiana University-Purdue University Indianapolis (IUPUI)Alzheimer's disease (AD) is a neurodegenerative disorder characterized by gradual loss of brain functions, usually preceded by memory impairments. It has been widely affecting aging Americans over 65 old and listed as 6th leading cause of death. More importantly, unlike other diseases, loss of brain function in AD progression usually leads to the significant decline in self-care abilities. And this will undoubtedly exert a lot of pressure on family members, friends, communities and the whole society due to the time-consuming daily care and high health care expenditures. In the past decade, while deaths attributed to the number one cause, heart disease, has decreased 16 percent, deaths attributed to AD has increased 68 percent. And all of these situations will continue to deteriorate as the population ages during the next several decades.
To prevent such health care crisis, substantial efforts have been made to help cure, slow or stop the progression of the disease. The massive data generated through these efforts, like multimodal neuroimaging scans as well as next generation sequences, provides unprecedented opportunities for researchers to look into the deep side of the disease, with more confidence and precision. While plenty of efforts have been made to pull in those existing machine learning and statistical models, the correlated structure and high dimensionality of imaging and genetics data are generally ignored or avoided through targeted analysis. Therefore their performances on imaging genetics study are quite limited and still have plenty to be improved.
The primary contribution of this work lies in the development of novel prior knowledge-guided regression and association models, and their applications in various neurobiological problems, such as identification of cognitive performance related imaging biomarkers and imaging genetics associations. In summary, this work has achieved the following research goals: (1) Explore the multimodal imaging biomarkers toward various cognitive functions using group-guided learning algorithms, (2) Development and application of novel network structure guided sparse regression model, (3) Development and application of novel network structure guided sparse multivariate association model, and (4) Promotion of the computation efficiency through parallelization strategies
Recommended from our members
Computational integration of genome-wide observational and functional data in cancer
The emergence of high throughput technologies is enabling the characterization of cancer genomes at unprecedented resolution and scale. However, such data suffer from the typical limitations of observational studies, which are frequently challenged by their inability to differentiate between causality and correlation. Recently, several datasets of genome-wide functional assays performed on tumor cell lines have become available. Given the ability of these assays to interrogate cancer genomes for the function of each individual gene, these data can provide vital cues to identify causal events and, with them, novel drug targets. Unfortunately, current analytical methods have been unable to overcome the challenges posed by these assays, which include poor signal to noise ratio and wide-spread off-target effects.
Given the largely orthogonal strengths and weaknesses of descriptive analysis of genetic and genomic observational data from cancer genomes and genome-wide functional screening, I hypothesized that integrating the two data types into unified computational models would significantly increase the power of the biological analysis. In this dissertation I use integrative approaches to tackle two crucial problems in cancer research: the identification of driver genes and the discovery of tumor lethalities. I use the resulting methods to study breast cancer, the second most common form of this disease.
The first part of the dissertation focuses on the analysis of regions of copy number alteration for the identification of driver genes. I first describe how a simple integrative method enabled the identification of BIN3, a novel driver of metastasis in breast cancer. I then describe Helios, an unsupervised method for the identification of driver genes in regions of SCNA that integrates different data sources into a single probabilistic score. Applying Helios to breast cancer data identified a set of candidate drivers highly enriched with known drivers (p-value < e-14). In vitro validation of 12 novel candidates predicted by Helios found 10 conferred enhanced anchorage independent growth, demonstrating Helios's exquisite sensitivity and specificity. I further provide an extensive characterization of RSF-1, a driver identified by Helios whose amplification
correlates with poor prognosis, which displayed increased tumorigenesis and metastasis in mouse models.
The second part of this dissertation addresses the problem of identifying tumor vulnerabilities using genome-wide shRNA screens across tumor cell lines. I approach this endeavor using a novel integrative method that employs different biomarkers of cellular state to facilitate the identification of clusters of hairpins with similar phenotype. When applied to breast cancer data, the method not only recapitulates the main subtypes and lethalities associated to this malignancy, but also identifies several novel putative lethalities.
Taken together, this research demonstrates the importance of the computational integration of genome-wide functional and observational data in cancer research, providing novel approaches that yield important insights into the biology of the disease
Genome-wide prediction of breeding values and mapping of quantitative trait loci in stratified and admixed populations
Ideally genome-wide association studies require homogenous samples originating from randomly mating populations with minimal pedigree relationship. However, in reality such samples are very hard to collect. Non-random mating combined with artificial selection has created complex pattern of population structure and relationship in commercial crop and livestock populations. This requires proper modeling of population structure and kinship a necessary step of all genome-wide association studies. Otherwise, the risk of both false-positives (declaring a marker as significant without it be linked to a QTL) and false-negatives (markers linked to a QTL declared as non-significant) increases dramatically.
In this thesis, we first applied genomic selection (GS) approach to develop equations for prediction of breeding values of purebred candidates based on a model trained on an admixed or crossbred population. In this approach all markers effects are treated as random and are fitted simultaneously. It was hypothesized that given a high-density marker data and using the GS approach; training in a crossbred or admixed population could be as accurate as training in a purebred population that is the target of selection. In a stochastic simulation study, it was shown that both crossbred and admixed populations could predict breeding values of a purebred population, without the need for explicitly modeling of breed composition and pedigree relationship. However, accuracy of GS was greatly reduced when genes from the target pure breed were not included in the admixed or crossbred training population. In addition, it was shown that the accuracy of GS depends on the genetic distance between the training and validation population, the closer the relationship between the two the higher was the prediction accuracy. Further, increasing of marker density improved the accuracy of prediction especially when a crossbred population has been used as the training dataset. Considering haplotypes with weak linkage disequilibrium (LD), the crossbreds showed extensive LD, whereas the LD in the purebreds was confined to smaller segments. In contrast, examination of the length of haplotypes with strong LD indicated that these haplotypes are much shorter in crossbreds than that in purebreds. Our results showed that in crossbred populations the number of haplotypes with strong LD is less than that in the purebred populations. The findings of this research suggested that the crossbred populations are more suitable for QTL fine mapping than the purebreds.
In addition, in another simulation study we compared power, false-positive rate, accuracy and positive predictive value of QTL mapping in an admixed population with and without modeling of breed composition. The performance of ordinary least square (OLS) and mixed model methods (MLM), both fitting one-marker-at-a-time, were compared to that of a Bayesian multiple-regression (BMR) method that fitted all markers simultaneously. The OLS method showed the highest rate of false-positives due to ignoring breed composition and pedigree relationship. The MLM approach showed spurious false-positives when breed composition was not accounted for. The BMR outperformed both OLS and MLM approaches. It was shown that BMR could mitigate the confounding effects of breed composition and relationship without compromising its power. In contrast to the MLM where fitting of breed composition reduced both its power and false-positive rates, when breed composition was considered in the BMR it resulted in loss of power without a change of false-positive rate. It was concluded that the BMR is able to self-correct for the effects of population structure and relatedness.</p
Statistical Methods in Neuroimaging Genetics: Pathways Sparse Regression and Cluster Size Inference
In the field of neuroimaging genetics, brain images are used as phenotypes in the search
for genetic variants associated with brain structure or function. This search presents a
formidable statistical challenge, not least because of the very high dimensionality of genotype
and phenotype data produced by modern SNP (single nucleotide polymorphism) arrays
and high resolution MRI. This thesis focuses on the use of multivariate sparse regression
models such as the group lasso and sparse group lasso for the identification of gene
pathways associated with both univariate and multivariate quantitative traits.
The methods described here take particular account of various factors specific to pathways
genome-wide association studies including widespread correlation (linkage disequilibrium)
between genetic predictors, and the fact that many variants overlap multiple pathways.
A resampling strategy that exploits finite sample variability is employed to provide
robust rankings for pathways, SNPs and genes. Comprehensive simulation studies are presented
comparing one proposed method, pathways group lasso with adaptive weights, to a
popular alternative. This method is extended to the case of a multivariate phenotype, and
the resulting pathways sparse reduced-rank regression model and algorithm is applied to a
study identifying gene pathways associated with structural change in the brain characteristic
of Alzheimerās disease. The original model is also adapted for the task of āpathways-drivenā
SNP and gene selection, and this latter model, pathways sparse group lasso with
adaptive weights, is applied in a search for SNPs and genes associated with elevated lipid
levels in two separate cohorts of Asian adults.
Finally, in a separate section an existing method for the identification of spatially extended clusters of image voxels with heightened activation is evaluated in an imaging genetic
context. This method, known as cluster size inference, rests on a number of assumptions.
Using real imaging and SNP data, false positive rates are found to be poorly controlled
outside of a narrow range of parameters related to image smoothness and activation
thresholds for cluster formation