825 research outputs found
Predicting Genetic Regulatory Response Using Classification
We present a novel classification-based method for learning to predict gene
regulatory response. Our approach is motivated by the hypothesis that in simple
organisms such as Saccharomyces cerevisiae, we can learn a decision rule for
predicting whether a gene is up- or down-regulated in a particular experiment
based on (1) the presence of binding site subsequences (``motifs'') in the
gene's regulatory region and (2) the expression levels of regulators such as
transcription factors in the experiment (``parents''). Thus our learning task
integrates two qualitatively different data sources: genome-wide cDNA
microarray data across multiple perturbation and mutant experiments along with
motif profile data from regulatory sequences. We convert the regression task of
predicting real-valued gene expression measurement to a classification task of
predicting +1 and -1 labels, corresponding to up- and down-regulation beyond
the levels of biological and measurement noise in microarray measurements. The
learning algorithm employed is boosting with a margin-based generalization of
decision trees, alternating decision trees. This large-margin classifier is
sufficiently flexible to allow complex logical functions, yet sufficiently
simple to give insight into the combinatorial mechanisms of gene regulation. We
observe encouraging prediction accuracy on experiments based on the Gasch S.
cerevisiae dataset, and we show that we can accurately predict up- and
down-regulation on held-out experiments. Our method thus provides predictive
hypotheses, suggests biological experiments, and provides interpretable insight
into the structure of genetic regulatory networks.Comment: 8 pages, 4 figures, presented at Twelfth International Conference on
Intelligent Systems for Molecular Biology (ISMB 2004), supplemental website:
http://www.cs.columbia.edu/compbio/geneclas
A Flexible and Adaptive Framework for Abstention Under Class Imbalance
In practical applications of machine learning, it is often desirable to
identify and abstain on examples where the model's predictions are likely to be
incorrect. Much of the prior work on this topic focused on out-of-distribution
detection or performance metrics such as top-k accuracy. Comparatively little
attention was given to metrics such as area-under-the-curve or Cohen's Kappa,
which are extremely relevant for imbalanced datasets. Abstention strategies
aimed at top-k accuracy can produce poor results on these metrics when applied
to imbalanced datasets, even when all examples are in-distribution. We propose
a framework to address this gap. Our framework leverages the insight that
calibrated probability estimates can be used as a proxy for the true class
labels, thereby allowing us to estimate the change in an arbitrary metric if an
example were abstained on. Using this framework, we derive computationally
efficient metric-specific abstention algorithms for optimizing the sensitivity
at a target specificity level, the area under the ROC, and the weighted Cohen's
Kappa. Because our method relies only on calibrated probability estimates, we
further show that by leveraging recent work on domain adaptation under label
shift, we can generalize to test-set distributions that may have a different
class imbalance compared to the training set distribution. On various
experiments involving medical imaging, natural language processing, computer
vision and genomics, we demonstrate the effectiveness of our approach. Source
code available at https://github.com/blindauth/abstention. Colab notebooks
reproducing results available at
https://github.com/blindauth/abstention_experiments
Comparative analysis of metazoan chromatin organization
Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal ‘arms’, and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, organization of large-scale topological domains, chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.National Science Foundation (U.S.) (1122374
Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design
The efficient exploration of chemical space to design molecules with intended
properties enables the accelerated discovery of drugs, materials, and
catalysts, and is one of the most important outstanding challenges in
chemistry. Encouraged by the recent surge in computer power and artificial
intelligence development, many algorithms have been developed to tackle this
problem. However, despite the emergence of many new approaches in recent years,
comparatively little progress has been made in developing realistic benchmarks
that reflect the complexity of molecular design for real-world applications. In
this work, we develop a set of practical benchmark tasks relying on physical
simulation of molecular systems mimicking real-life molecular design problems
for materials, drugs, and chemical reactions. Additionally, we demonstrate the
utility and ease of use of our new benchmark set by demonstrating how to
compare the performance of several well-established families of algorithms.
Surprisingly, we find that model performance can strongly depend on the
benchmark domain. We believe that our benchmark suite will help move the field
towards more realistic molecular design benchmarks, and move the development of
inverse molecular design algorithms closer to designing molecules that solve
existing problems in both academia and industry alike.Comment: 29+21 pages, 6+19 figures, 6+2 table
- …