3,268 research outputs found
Gibbs Max-margin Topic Models with Data Augmentation
Max-margin learning is a powerful approach to building classifiers and
structured output predictors. Recent work on max-margin supervised topic models
has successfully integrated it with Bayesian topic models to discover
discriminative latent semantic structures and make accurate predictions for
unseen testing data. However, the resulting learning problems are usually hard
to solve because of the non-smoothness of the margin loss. Existing approaches
to building max-margin supervised topic models rely on an iterative procedure
to solve multiple latent SVM subproblems with additional mean-field assumptions
on the desired posterior distributions. This paper presents an alternative
approach by defining a new max-margin loss. Namely, we present Gibbs max-margin
supervised topic models, a latent variable Gibbs classifier to discover hidden
topic representations for various tasks, including classification, regression
and multi-task learning. Gibbs max-margin supervised topic models minimize an
expected margin loss, which is an upper bound of the existing margin loss
derived from an expected prediction rule. By introducing augmented variables
and integrating out the Dirichlet variables analytically by conjugacy, we
develop simple Gibbs sampling algorithms with no restricting assumptions and no
need to solve SVM subproblems. Furthermore, each step of the
"augment-and-collapse" Gibbs sampling algorithms has an analytical conditional
distribution, from which samples can be easily drawn. Experimental results
demonstrate significant improvements on time efficiency. The classification
performance is also significantly improved over competitors on binary,
multi-class and multi-label classification tasks.Comment: 35 page
Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate Folding Landscape and Protein Structure Prediction
Data-driven predictive methods which can efficiently and accurately transform
protein sequences into biologically active structures are highly valuable for
scientific research and medical development. Determining accurate folding
landscape using co-evolutionary information is fundamental to the success of
modern protein structure prediction methods. As the state of the art,
AlphaFold2 has dramatically raised the accuracy without performing explicit
co-evolutionary analysis. Nevertheless, its performance still shows strong
dependence on available sequence homologs. Based on the interrogation on the
cause of such dependence, we presented EvoGen, a meta generative model, to
remedy the underperformance of AlphaFold2 for poor MSA targets. By prompting
the model with calibrated or virtually generated homologue sequences, EvoGen
helps AlphaFold2 fold accurately in low-data regime and even achieve
encouraging performance with single-sequence predictions. Being able to make
accurate predictions with few-shot MSA not only generalizes AlphaFold2 better
for orphan sequences, but also democratizes its use for high-throughput
applications. Besides, EvoGen combined with AlphaFold2 yields a probabilistic
structure generation method which could explore alternative conformations of
protein sequences, and the task-aware differentiable algorithm for sequence
generation will benefit other related tasks including protein design.Comment: version 2.0; 28 pages, 6 figure
Towards Data-centric Graph Machine Learning: Review and Outlook
Data-centric AI, with its primary focus on the collection, management, and
utilization of data to drive AI models and applications, has attracted
increasing attention in recent years. In this article, we conduct an in-depth
and comprehensive review, offering a forward-looking outlook on the current
efforts in data-centric AI pertaining to graph data-the fundamental data
structure for representing and capturing intricate dependencies among massive
and diverse real-life entities. We introduce a systematic framework,
Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of
the graph data lifecycle, including graph data collection, exploration,
improvement, exploitation, and maintenance. A thorough taxonomy of each stage
is presented to answer three critical graph-centric questions: (1) how to
enhance graph data availability and quality; (2) how to learn from graph data
with limited-availability and low-quality; (3) how to build graph MLOps systems
from the graph data-centric view. Lastly, we pinpoint the future prospects of
the DC-GML domain, providing insights to navigate its advancements and
applications.Comment: 42 pages, 9 figure
Retrosynthesis prediction enhanced by in-silico reaction data augmentation
Recent advances in machine learning (ML) have expedited retrosynthesis
research by assisting chemists to design experiments more efficiently. However,
all ML-based methods consume substantial amounts of paired training data (i.e.,
chemical reaction: product-reactant(s) pair), which is costly to obtain.
Moreover, companies view reaction data as a valuable asset and restrict the
accessibility to researchers. These issues prevent the creation of more
powerful retrosynthesis models due to their data-driven nature. As a response,
we exploit easy-to-access unpaired data (i.e., one component of
product-reactant(s) pair) for generating in-silico paired data to facilitate
model training. Specifically, we present RetroWISE, a self-boosting framework
that employs a base model inferred from real paired data to perform in-silico
reaction generation and augmentation using unpaired data, ultimately leading to
a superior model. On three benchmark datasets, RetroWISE achieves the best
overall performance against state-of-the-art models (e.g., +8.6% top-1 accuracy
on the USPTO-50K test dataset). Moreover, it consistently improves the
prediction accuracy of rare transformations. These results show that Retro-
WISE overcomes the training bottleneck by in-silico reactions, thereby paving
the way toward more effective ML-based retrosynthesis models
Protein-Ligand Scoring with Convolutional Neural Networks
Computational approaches to drug discovery can reduce the time and cost
associated with experimental assays and enable the screening of novel
chemotypes. Structure-based drug design methods rely on scoring functions to
rank and predict binding affinities and poses. The ever-expanding amount of
protein-ligand binding and structural data enables the use of deep machine
learning techniques for protein-ligand scoring.
We describe convolutional neural network (CNN) scoring functions that take as
input a comprehensive 3D representation of a protein-ligand interaction. A CNN
scoring function automatically learns the key features of protein-ligand
interactions that correlate with binding. We train and optimize our CNN scoring
functions to discriminate between correct and incorrect binding poses and known
binders and non-binders. We find that our CNN scoring function outperforms the
AutoDock Vina scoring function when ranking poses both for pose prediction and
virtual screening
Generative Augmented Flow Networks
The Generative Flow Network is a probabilistic framework where an agent
learns a stochastic policy for object generation, such that the probability of
generating an object is proportional to a given reward function. Its
effectiveness has been shown in discovering high-quality and diverse solutions,
compared to reward-maximizing reinforcement learning-based methods.
Nonetheless, GFlowNets only learn from rewards of the terminal states, which
can limit its applicability. Indeed, intermediate rewards play a critical role
in learning, for example from intrinsic motivation to provide intermediate
feedback even in particularly challenging sparse reward tasks. Inspired by
this, we propose Generative Augmented Flow Networks (GAFlowNets), a novel
learning framework to incorporate intermediate rewards into GFlowNets. We
specify intermediate rewards by intrinsic motivation to tackle the exploration
problem in sparse reward environments. GAFlowNets can leverage edge-based and
state-based intrinsic rewards in a joint way to improve exploration. Based on
extensive experiments on the GridWorld task, we demonstrate the effectiveness
and efficiency of GAFlowNet in terms of convergence, performance, and diversity
of solutions. We further show that GAFlowNet is scalable to a more complex and
large-scale molecule generation domain, where it achieves consistent and
significant performance improvement
- …