44 research outputs found
Greedy MAXCUT Algorithms and their Information Content
MAXCUT defines a classical NP-hard problem for graph partitioning and it
serves as a typical case of the symmetric non-monotone Unconstrained Submodular
Maximization (USM) problem. Applications of MAXCUT are abundant in machine
learning, computer vision and statistical physics. Greedy algorithms to
approximately solve MAXCUT rely on greedy vertex labelling or on an edge
contraction strategy. These algorithms have been studied by measuring their
approximation ratios in the worst case setting but very little is known to
characterize their robustness to noise contaminations of the input data in the
average case. Adapting the framework of Approximation Set Coding, we present a
method to exactly measure the cardinality of the algorithmic approximation sets
of five greedy MAXCUT algorithms. Their information contents are explored for
graph instances generated by two different noise models: the edge reversal
model and Gaussian edge weights model. The results provide insights into the
robustness of different greedy heuristics and techniques for MAXCUT, which can
be used for algorithm design of general USM problems.Comment: This is a longer version of the paper published in 2015 IEEE
Information Theory Workshop (ITW
From Sets to Multisets: Provable Variational Inference for Probabilistic Integer Submodular Models
Submodular functions have been studied extensively in machine learning and
data mining. In particular, the optimization of submodular functions over the
integer lattice (integer submodular functions) has recently attracted much
interest, because this domain relates naturally to many practical problem
settings, such as multilabel graph cut, budget allocation and revenue
maximization with discrete assignments. In contrast, the use of these functions
for probabilistic modeling has received surprisingly little attention so far.
In this work, we firstly propose the Generalized Multilinear Extension, a
continuous DR-submodular extension for integer submodular functions. We study
central properties of this extension and formulate a new probabilistic model
which is defined through integer submodular functions. Then, we introduce a
block-coordinate ascent algorithm to perform approximate inference for those
class of models. Finally, we demonstrate its effectiveness and viability on
several real-world social connection graph datasets with integer submodular
objectives
Activity Cliff Prediction: Dataset and Benchmark
Activity cliffs (ACs), which are generally defined as pairs of structurally
similar molecules that are active against the same bio-target but significantly
different in the binding potency, are of great importance to drug discovery. Up
to date, the AC prediction problem, i.e., to predict whether a pair of
molecules exhibit the AC relationship, has not yet been fully explored. In this
paper, we first introduce ACNet, a large-scale dataset for AC prediction. ACNet
curates over 400K Matched Molecular Pairs (MMPs) against 190 targets, including
over 20K MMP-cliffs and 380K non-AC MMPs, and provides five subsets for model
development and evaluation. Then, we propose a baseline framework to benchmark
the predictive performance of molecular representations encoded by deep neural
networks for AC prediction, and 16 models are evaluated in experiments. Our
experimental results show that deep learning models can achieve good
performance when the models are trained on tasks with adequate amount of data,
while the imbalanced, low-data and out-of-distribution features of the ACNet
dataset still make it challenging for deep neural networks to cope with. In
addition, the traditional ECFP method shows a natural advantage on MMP-cliff
prediction, and outperforms other deep learning models on most of the data
subsets. To the best of our knowledge, our work constructs the first
large-scale dataset for AC prediction, which may stimulate the study of AC
prediction models and prompt further breakthroughs in AI-aided drug discovery.
The codes and dataset can be accessed by https://drugai.github.io/ACNet/
SyNDock: N Rigid Protein Docking via Learnable Group Synchronization
The regulation of various cellular processes heavily relies on the protein
complexes within a living cell, necessitating a comprehensive understanding of
their three-dimensional structures to elucidate the underlying mechanisms.
While neural docking techniques have exhibited promising outcomes in binary
protein docking, the application of advanced neural architectures to multimeric
protein docking remains uncertain. This study introduces SyNDock, an automated
framework that swiftly assembles precise multimeric complexes within seconds,
showcasing performance that can potentially surpass or be on par with recent
advanced approaches. SyNDock possesses several appealing advantages not present
in previous approaches. Firstly, SyNDock formulates multimeric protein docking
as a problem of learning global transformations to holistically depict the
placement of chain units of a complex, enabling a learning-centric solution.
Secondly, SyNDock proposes a trainable two-step SE(3) algorithm, involving
initial pairwise transformation and confidence estimation, followed by global
transformation synchronization. This enables effective learning for assembling
the complex in a globally consistent manner. Lastly, extensive experiments
conducted on our proposed benchmark dataset demonstrate that SyNDock
outperforms existing docking software in crucial performance metrics, including
accuracy and runtime. For instance, it achieves a 4.5% improvement in
performance and a remarkable millionfold acceleration in speed
Understanding and Improving Feature Learning for Out-of-Distribution Generalization
A common explanation for the failure of out-of-distribution (OOD)
generalization is that the model trained with empirical risk minimization (ERM)
learns spurious features instead of invariant features. However, several recent
studies challenged this explanation and found that deep networks may have
already learned sufficiently good features for OOD generalization. Despite the
contradictions at first glance, we theoretically show that ERM essentially
learns both spurious and invariant features, while ERM tends to learn spurious
features faster if the spurious correlation is stronger. Moreover, when fed the
ERM learned features to the OOD objectives, the invariant feature learning
quality significantly affects the final OOD performance, as OOD objectives
rarely learn new features. Therefore, ERM feature learning can be a bottleneck
to OOD generalization. To alleviate the reliance, we propose Feature Augmented
Training (FeAT), to enforce the model to learn richer features ready for OOD
generalization. FeAT iteratively augments the model to learn new features while
retaining the already learned features. In each round, the retention and
augmentation operations are performed on different subsets of the training data
that capture distinct features. Extensive experiments show that FeAT
effectively learns richer features thus boosting the performance of various OOD
objectives.Comment: Yongqiang Chen, Wei Huang, and Kaiwen Zhou contributed equally;
NeurIPS 2023, 55 pages, 64 figure
Rethinking and Simplifying Bootstrapped Graph Latents
Graph contrastive learning (GCL) has emerged as a representative paradigm in
graph self-supervised learning, where negative samples are commonly regarded as
the key to preventing model collapse and producing distinguishable
representations. Recent studies have shown that GCL without negative samples
can achieve state-of-the-art performance as well as scalability improvement,
with bootstrapped graph latent (BGRL) as a prominent step forward. However,
BGRL relies on a complex architecture to maintain the ability to scatter
representations, and the underlying mechanisms enabling the success remain
largely unexplored. In this paper, we introduce an instance-level decorrelation
perspective to tackle the aforementioned issue and leverage it as a springboard
to reveal the potential unnecessary model complexity within BGRL. Based on our
findings, we present SGCL, a simple yet effective GCL framework that utilizes
the outputs from two consecutive iterations as positive pairs, eliminating the
negative samples. SGCL only requires a single graph augmentation and a single
graph encoder without additional parameters. Extensive experiments conducted on
various graph benchmarks demonstrate that SGCL can achieve competitive
performance with fewer parameters, lower time and space costs, and significant
convergence speedup.Comment: Accepted by WSDM 202