12 research outputs found
Mixture-of-Experts Variational Autoencoder for Clustering and Generating from Similarity-Based Representations on Single Cell Data
Clustering high-dimensional data, such as images or biological measurements,
is a long-standingproblem and has been studied extensively. Recently, Deep
Clustering has gained popularity due toits flexibility in fitting the specific
peculiarities of complex data. Here we introduce the Mixture-of-Experts
Similarity Variational Autoencoder (MoE-Sim-VAE), a novel generative clustering
model.The model can learn multi-modal distributions of high-dimensional data
and use these to generaterealistic data with high efficacy and efficiency.
MoE-Sim-VAE is based on a Variational Autoencoder(VAE), where the decoder
consists of a Mixture-of-Experts (MoE) architecture. This specific architecture
allows for various modes of the data to be automatically learned by means of
the experts.Additionally, we encourage the lower dimensional latent
representation of our model to follow aGaussian mixture distribution and to
accurately represent the similarities between the data points. Weassess the
performance of our model on the MNIST benchmark data set and challenging
real-worldtasks of clustering mouse organs from single-cell RNA-sequencing
measurements and defining cellsubpopulations from mass cytometry (CyTOF)
measurements on hundreds of different datasets.MoE-Sim-VAE exhibits superior
clustering performance on all these tasks in comparison to thebaselines as well
as competitor methods.Comment: Submitted to PLOS Computational Biolog
Isotropic Gaussian Processes on Finite Spaces of Graphs
We propose a principled way to define Gaussian process priors on various sets
of unweighted graphs: directed or undirected, with or without loops. We endow
each of these sets with a geometric structure, inducing the notions of
closeness and symmetries, by turning them into a vertex set of an appropriate
metagraph. Building on this, we describe the class of priors that respect this
structure and are analogous to the Euclidean isotropic processes, like squared
exponential or Mat\'ern. We propose an efficient computational technique for
the ostensibly intractable problem of evaluating these priors' kernels, making
such Gaussian processes usable within the usual toolboxes and downstream
applications. We go further to consider sets of equivalence classes of
unweighted graphs and define the appropriate versions of priors thereon. We
prove a hardness result, showing that in this case, exact kernel computation
cannot be performed efficiently. However, we propose a simple Monte Carlo
approximation for handling moderately sized cases. Inspired by applications in
chemistry, we illustrate the proposed techniques on a real molecular property
prediction task in the small data regime
DockGame: Cooperative Games for Multimeric Rigid Protein Docking
Protein interactions and assembly formation are fundamental to most
biological processes. Predicting the assembly structure from constituent
proteins -- referred to as the protein docking task -- is thus a crucial step
in protein design applications. Most traditional and deep learning methods for
docking have focused mainly on binary docking, following either a search-based,
regression-based, or generative modeling paradigm. In this paper, we focus on
the less-studied multimeric (i.e., two or more proteins) docking problem. We
introduce DockGame, a novel game-theoretic framework for docking -- we view
protein docking as a cooperative game between proteins, where the final
assembly structure(s) constitute stable equilibria w.r.t. the underlying game
potential. Since we do not have access to the true potential, we consider two
approaches - i) learning a surrogate game potential guided by physics-based
energy functions and computing equilibria by simultaneous gradient updates, and
ii) sampling from the Gibbs distribution of the true potential by learning a
diffusion generative model over the action spaces (rotations and translations)
of all proteins. Empirically, on the Docking Benchmark 5.5 (DB5.5) dataset,
DockGame has much faster runtimes than traditional docking methods, can
generate multiple plausible assembly structures, and achieves comparable
performance to existing binary docking baselines, despite solving the harder
task of coordinating multiple protein chains.Comment: Under Revie
Learning Graph Models for Retrosynthesis Prediction
Retrosynthesis prediction is a fundamental problem in organic synthesis,
where the task is to identify precursor molecules that can be used to
synthesize a target molecule. A key consideration in building neural models for
this task is aligning model design with strategies adopted by chemists.
Building on this viewpoint, this paper introduces a graph-based approach that
capitalizes on the idea that the graph topology of precursor molecules is
largely unaltered during a chemical reaction. The model first predicts the set
of graph edits transforming the target into incomplete molecules called
synthons. Next, the model learns to expand synthons into complete molecules by
attaching relevant leaving groups. This decomposition simplifies the
architecture, making its predictions more interpretable, and also amenable to
manual correction. Our model achieves a top-1 accuracy of ,
outperforming previous template-free and semi-template-based methods
Multi-Scale Representation Learning on Proteins
Proteins are fundamental biological entities mediating key roles in cellular function and disease. This paper introduces a multi-scale graph construction of a protein –HoloProt– connecting surface to structure and sequence. The surface captures coarser details of the protein, while sequence as primary component and structure –comprising secondary and tertiary components– capture finer details. Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level. We test the learned representation on different tasks, (i.) ligand binding affinity (regression), and (ii.) protein function prediction (classification).On the regression task, contrary to previous methods, our model performs consistently and reliably across different dataset splits, outperforming all baselines on most splits. On the classification task, it achieves a performance close to the top-performing model while using 10x fewer parameters. To improve the memory efficiency of our construction, we segment the multiplex protein surface manifold into molecular superpixels and substitute the surface with these superpixels at little to no performance loss
Multi-Scale Representation Learning on Proteins
Proteins are fundamental biological entities mediating key roles in cellular function and disease. This paper introduces a multi-scale graph construction of a protein –HoloProt– connecting surface to structure and sequence. The surface captures coarser details of the protein, while sequence as primary component and structure –comprising secondary and tertiary components– capture finer details. Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level. We test the learned representation on different tasks, (i.) ligand binding affinity (regression), and (ii.) protein function prediction (classification).On the regression task, contrary to previous methods, our model performs consistently and reliably across different dataset splits, outperforming all baselines on most splits. On the classification task, it achieves a performance close to the top-performing model while using 10x fewer parameters. To improve the memory efficiency of our construction, we segment the multiplex protein surface manifold into molecular superpixels and substitute the surface with these superpixels at little to no performance loss
DockGame: Cooperative Games for Multimeric Rigid Protein Docking
Raw datasets used for DB5.5 dataset and DIPS dataset used in DockGame, with associated code at https://github.com/vsomnath/dockgame
These datasets can also be downloaded from https://zlab.umassmed.edu/benchmark (DB5.5) and https://www.dropbox.com/s/sqknqofy58nlosh/DIPS.zip?dl=0 (DIPS). Note that our organization would be slightly different from these downloaded versions.
We follow EquiDock (https://github.com/octavian-ganea/equidock_public) for the DB5 dataset splits.
Paper Abstract
Protein interactions and assembly formation are fundamental to most biological processes. Predicting the assembly structure from constituent proteins -- referred to as the protein docking task -- is thus a crucial step in protein design applications. Most traditional and deep learning methods for docking have focused mainly on binary docking, following either a search-based, regression-based, or generative modeling paradigm. In this paper, we focus on the less-studied multimeric (i.e., two or more proteins) docking problem. We introduce DockGame, a novel game-theoretic framework for docking -- we view protein docking as a cooperative game between proteins, where the final assembly structure(s) constitute stable equilibria w.r.t. the underlying game potential. Since we do not have access to the true potential, we consider two approaches - i) learning a surrogate game potential guided by physics-based energy functions and computing equilibria by simultaneous gradient updates, and ii) sampling from the Gibbs distribution of the true potential by learning a diffusion generative model over the action spaces (rotations and translations) of all proteins. Empirically, on the Docking Benchmark 5.5 (DB5.5) dataset, DockGame has much faster runtimes than traditional docking methods, can generate multiple plausible assembly structures, and achieves comparable performance to existing binary docking baselines, despite solving the harder task of coordinating multiple protein chains
Learning Graph Models for Retrosynthesis Prediction
Retrosynthesis prediction is a fundamental problem in organic synthesis, where the task is to identify precursor molecules that can be used to synthesize a target molecule. A key consideration in building neural models for this task is aligning model design with strategies adopted by chemists. Building on this viewpoint, this paper introduces a graph-based approach that capitalizes on the idea that the graph topology of precursor molecules is largely unaltered during a chemical reaction. The model first predicts the set of graph edits transforming the target into incomplete molecules called synthons. Next, the model learns to expand synthons into complete molecules by attaching relevant leaving groups. This decomposition simplifies the architecture, making its predictions more interpretable, and also amenable to manual correction. Our model achieves a top-1 accuracy of 53.7%, outperforming previous template-free and semi-template-based methods
BaCaDI: Bayesian Causal Discovery with Unknown Interventions
Inferring causal structures from experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, the targets of the interventions are often uncertain or unknown and the number of observations limited. As a result, standard causal discovery methods can no longer be reliably used. To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering and reasoning about the causal structure that underlies data generated under various unknown experimental or interventional conditions. BaCaDI is fully differentiable, which allows us to infer the complex joint posterior over the intervention targets and the causal structure via efficient gradient-based variational inference. In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets.ISSN:2640-349
BaCaDI: Bayesian Causal Discovery with Unknown Interventions
Learning causal structures from observation and experimentation is a central task in many domains. For example, in biology, recent advances allow us to obtain single-cell expression data under multiple interventions such as drugs or gene knockouts. However, a key challenge is that often the targets of the interventions are uncertain or unknown. Thus, standard causal discovery methods can no longer be used.
To fill this gap, we propose a Bayesian framework (BaCaDI) for discovering the causal structure that underlies data generated under various unknown experimental/interventional conditions.
BaCaDI is fully differentiable and operates in the continuous space of latent probabilistic representations of both causal structures and interventions. This enables us to approximate complex posteriors via gradient-based variational inference and to reason about the epistemic uncertainty in the predicted structure.
In experiments on synthetic causal discovery tasks and simulated gene-expression data, BaCaDI outperforms related methods in identifying causal structures and intervention targets. Finally, we demonstrate that, thanks to its rigorous Bayesian approach, our method provides well-calibrated uncertainty estimates