80 research outputs found
The perpetual motion machine of AI-generated data and the distraction of ChatGPT-as-scientist
Since ChatGPT works so well, are we on the cusp of solving science with AI?
Is not AlphaFold2 suggestive that the potential of LLMs in biology and the
sciences more broadly is limitless? Can we use AI itself to bridge the lack of
data in the sciences in order to then train an AI? Herein we present a
discussion of these topics
A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization
We show that a large class of Estimation of Distribution Algorithms,
including, but not limited to, Covariance Matrix Adaption, can be written as a
Monte Carlo Expectation-Maximization algorithm, and as exact EM in the limit of
infinite samples. Because EM sits on a rigorous statistical foundation and has
been thoroughly analyzed, this connection provides a new coherent framework
with which to reason about EDAs
Gaussian Process Prior Variational Autoencoders
Variational autoencoders (VAE) are a powerful and widely-used class of models
to learn complex data distributions in an unsupervised fashion. One important
limitation of VAEs is the prior assumption that latent sample representations
are independent and identically distributed. However, for many important
datasets, such as time-series of images, this assumption is too strong:
accounting for covariances between samples, such as those in time, can yield to
a more appropriate model specification and improve performance in downstream
tasks. In this work, we introduce a new model, the Gaussian Process (GP) Prior
Variational Autoencoder (GPPVAE), to specifically address this issue. The
GPPVAE aims to combine the power of VAEs with the ability to model correlations
afforded by GP priors. To achieve efficient inference in this new class of
models, we leverage structure in the covariance matrix, and introduce a new
stochastic backpropagation strategy that allows for computing stochastic
gradients in a distributed and low-memory fashion. We show that our method
outperforms conditional VAEs (CVAEs) and an adaptation of standard VAEs in two
image data applications.Comment: Accepted at 32nd Conference on Neural Information Processing Systems
(NIPS 2018), Montr\'eal, Canad
A Statistical Framework for Modeling HLA-Dependent T Cell Response Data
The identification of T cell epitopes and their HLA (human leukocyte antigen) restrictions is important for applications such as the design of cellular vaccines for HIV. Traditional methods for such identification are costly and time-consuming. Recently, a more expeditious laboratory technique using ELISpot assays has been developed that allows for rapid screening of specific responses. However, this assay does not directly provide information concerning the HLA restriction of a response, a critical piece of information for vaccine design. Thus, we introduce, apply, and validate a statistical model for identifying HLA-restricted epitopes from ELISpot data. By looking at patterns across a broad range of donors, in conjunction with our statistical model, we can determine (probabilistically) which of the HLA alleles are likely to be responsible for the observed reactivities. Additionally, we can provide a good estimate of the number of false positives generated by our analysis (i.e., the false discovery rate). This model allows us to learn about new HLA-restricted epitopes from ELISpot data in an efficient, cost-effective, and high-throughput manner. We applied our approach to data from donors infected with HIV and identified many potential new HLA restrictions. Among 134 such predictions, six were confirmed in the lab and the remainder could not be ruled as invalid. These results shed light on the extent of HLA class I promiscuity, which has significant implications for the understanding of HLA class I antigen presentation and vaccine development
The benefits of selecting phenotype-specific variants for applications of mixed models in genomics
Applications of linear mixed models (LMMs) to problems in genomics include phenotype prediction, correction for confounding in genome-wide association studies, estimation of narrow sense heritability, and testing sets of variants (e.g., rare variants) for association. In each of these applications, the LMM uses a genetic similarity matrix, which encodes the pairwise similarity between every two individuals in a cohort. Although ideally these similarities would be estimated using strictly variants relevant to the given phenotype, the identity of such variants is typically unknown. Consequently, relevant variants are excluded and irrelevant variants are included, both having deleterious effects. For each application of the LMM, we review known effects and describe new effects showing how variable selection can be used to mitigate them.National Institute on Aging (Brain eQTL Study (dbGaP phs000249.v1.p1)
A powerful and efficient set test for genetic markers that handles confounders
Approaches for testing sets of variants, such as a set of rare or common
variants within a gene or pathway, for association with complex traits are
important. In particular, set tests allow for aggregation of weak signal within
a set, can capture interplay among variants, and reduce the burden of multiple
hypothesis testing. Until now, these approaches did not address confounding by
family relatedness and population structure, a problem that is becoming more
important as larger data sets are used to increase power.
Results: We introduce a new approach for set tests that handles confounders.
Our model is based on the linear mixed model and uses two random effects-one to
capture the set association signal and one to capture confounders. We also
introduce a computational speedup for two-random-effects models that makes this
approach feasible even for extremely large cohorts. Using this model with both
the likelihood ratio test and score test, we find that the former yields more
power while controlling type I error. Application of our approach to richly
structured GAW14 data demonstrates that our method successfully corrects for
population structure and family relatedness, while application of our method to
a 15,000 individual Crohn's disease case-control cohort demonstrates that it
additionally recovers genes not recoverable by univariate analysis.
Availability: A Python-based library implementing our approach is available
at http://mscompbio.codeplex.comComment: * denotes equal contribution
Statistical Resolution of Ambiguous HLA Typing Data
High-resolution HLA typing plays a central role in many areas of immunology, such as in identifying immunogenetic risk factors for disease, in studying how the genomes of pathogens evolve in response to immune selection pressures, and also in vaccine design, where identification of HLA-restricted epitopes may be used to guide the selection of vaccine immunogens. Perhaps one of the most immediate applications is in direct medical decisions concerning the matching of stem cell transplant donors to unrelated recipients. However, high-resolution HLA typing is frequently unavailable due to its high cost or the inability to re-type historical data. In this paper, we introduce and evaluate a method for statistical, in silico refinement of ambiguous and/or low-resolution HLA data. Our method, which requires an independent, high-resolution training data set drawn from the same population as the data to be refined, uses linkage disequilibrium in HLA haplotypes as well as four-digit allele frequency data to probabilistically refine HLA typings. Central to our approach is the use of haplotype inference. We introduce new methodology to this area, improving upon the Expectation-Maximization (EM)-based approaches currently used within the HLA community. Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution. These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner. We also show how to augment our method in order to incorporate ethnicity information (as HLA allele distributions vary widely according to race/ethnicity as well as geographic area), and demonstrate the potential utility of this experimentally. A tool based on our approach is freely available for research purposes at http://microsoft.com/science
Co-Operative Additive Effects between HLA Alleles in Control of HIV-1
Background: HLA class I genotype is a major determinant of the outcome of HIV infection, and the impact of certain alleles on HIV disease outcome is well studied. Recent studies have demonstrated that certain HLA class I alleles that are in linkage disequilibrium, such as HLA-A*74 and HLA-B*57, appear to function co-operatively to result in greater immune control of HIV than mediated by either single allele alone. We here investigate the extent to which HLA alleles - irrespective of linkage disequilibrium - function co-operatively. Methodology/Principal Findings: We here refined a computational approach to the analysis of >2000 subjects infected with C-clade HIV first to discern the individual effect of each allele on disease control, and second to identify pairs of alleles that mediate ‘co-operative additive’ effects, either to improve disease suppression or to contribute to immunological failure. We identified six pairs of HLA class I alleles that have a co-operative additive effect in mediating HIV disease control and four hazardous pairs of alleles that, occurring together, are predictive of worse disease outcomes (q<0.05 in each case). We developed a novel ‘sharing score’ to quantify the breadth of CD8+ T cell responses made by pairs of HLA alleles across the HIV proteome, and used this to demonstrate that successful viraemic suppression correlates with breadth of unique CD8+ T cell responses (p = 0.03). Conclusions/Significance: These results identify co-operative effects between HLA Class I alleles in the control of HIV-1 in an extended Southern African cohort, and underline complementarity and breadth of the CD8+ T cell targeting as one potential mechanism for this effect
- …