62 research outputs found
Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network
Naturally occurring networks exhibit quantitative features revealing
underlying growth mechanisms. Numerous network mechanisms have recently been
proposed to reproduce specific properties such as degree distributions or
clustering coefficients. We present a method for inferring the mechanism most
accurately capturing a given network topology, exploiting discriminative tools
from machine learning. The Drosophila melanogaster protein network is
confidently and robustly (to noise and training data subsampling) classified as
a duplication-mutation-complementation network over preferential attachment,
small-world, and other duplication-mutation mechanisms. Systematic
classification, rather than statistical study of specific properties, provides
a discriminative approach to understand the design of complex networks.Comment: 19 pages, 5 figure
Predicting Genetic Regulatory Response Using Classification
We present a novel classification-based method for learning to predict gene
regulatory response. Our approach is motivated by the hypothesis that in simple
organisms such as Saccharomyces cerevisiae, we can learn a decision rule for
predicting whether a gene is up- or down-regulated in a particular experiment
based on (1) the presence of binding site subsequences (``motifs'') in the
gene's regulatory region and (2) the expression levels of regulators such as
transcription factors in the experiment (``parents''). Thus our learning task
integrates two qualitatively different data sources: genome-wide cDNA
microarray data across multiple perturbation and mutant experiments along with
motif profile data from regulatory sequences. We convert the regression task of
predicting real-valued gene expression measurement to a classification task of
predicting +1 and -1 labels, corresponding to up- and down-regulation beyond
the levels of biological and measurement noise in microarray measurements. The
learning algorithm employed is boosting with a margin-based generalization of
decision trees, alternating decision trees. This large-margin classifier is
sufficiently flexible to allow complex logical functions, yet sufficiently
simple to give insight into the combinatorial mechanisms of gene regulation. We
observe encouraging prediction accuracy on experiments based on the Gasch S.
cerevisiae dataset, and we show that we can accurately predict up- and
down-regulation on held-out experiments. Our method thus provides predictive
hypotheses, suggests biological experiments, and provides interpretable insight
into the structure of genetic regulatory networks.Comment: 8 pages, 4 figures, presented at Twelfth International Conference on
Intelligent Systems for Molecular Biology (ISMB 2004), supplemental website:
http://www.cs.columbia.edu/compbio/geneclas
Information-theoretic approach to network modularity
Exploiting recent developments in information theory, we propose, illustrate, and validate a principled information-theoretic algorithm for module discovery and the resulting measure of network modularity. This measure is an order parameter (a dimensionless number between 0 and 1). Comparison is made with other approaches to module discovery and to quantifying network modularity (using Monte Carlo generated Erdös-like modular networks). Finally, the network information bottleneck (NIB) algorithm is applied to a number of real world networks, including the “social” network of coauthors at the 2004 APS March Meeting
Systematic identification of statistically significant network measures
We present a graph embedding space (i.e., a set of measures on graphs) for performing statistical analyses of networks. Key improvements over existing approaches include discovery of “motif hubs” (multiple overlapping significant subgraphs), computational efficiency relative to subgraph census, and flexibility (the method is easily generalizable to weighted and signed graphs). The embedding space is based on scalars, functionals of the adjacency matrix representing the network. Scalars are global, involving all nodes; although they can be related to subgraph enumeration, there is not a one-to-one mapping between scalars and subgraphs. Improvements in network randomization and significance testing—we learn the distribution rather than assuming Gaussianity—are also presented. The resulting algorithm establishes a systematic approach to the identification of the most significant scalars and suggests machine-learning techniques for network classification
Discriminative Topological Features Reveal Biological Network Mechanisms
Recent genomic and bioinformatic advances have motivated the development of
numerous random network models purporting to describe graphs of biological,
technological, and sociological origin. The success of a model has been
evaluated by how well it reproduces a few key features of the real-world data,
such as degree distributions, mean geodesic lengths, and clustering
coefficients. Often pairs of models can reproduce these features with
indistinguishable fidelity despite being generated by vastly different
mechanisms. In such cases, these few target features are insufficient to
distinguish which of the different models best describes real world networks of
interest; moreover, it is not clear a priori that any of the presently-existing
algorithms for network generation offers a predictive description of the
networks inspiring them. To derive discriminative classifiers, we construct a
mapping from the set of all graphs to a high-dimensional (in principle
infinite-dimensional) ``word space.'' This map defines an input space for
classification schemes which allow us for the first time to state unambiguously
which models are most descriptive of the networks they purport to describe. Our
training sets include networks generated from 17 models either drawn from the
literature or introduced in this work, source code for which is freely
available. We anticipate that this new approach to network analysis will be of
broad impact to a number of communities.Comment: supplemental website:
http://www.columbia.edu/itc/applied/wiggins/netclass
A classification-based framework for predicting and analyzing gene regulatory response
BACKGROUND: We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem — predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. METHODS: In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data. RESULTS: Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast — the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors — and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from
Measurement of the cosmic ray spectrum above eV using inclined events detected with the Pierre Auger Observatory
A measurement of the cosmic-ray spectrum for energies exceeding
eV is presented, which is based on the analysis of showers
with zenith angles greater than detected with the Pierre Auger
Observatory between 1 January 2004 and 31 December 2013. The measured spectrum
confirms a flux suppression at the highest energies. Above
eV, the "ankle", the flux can be described by a power law with
index followed by
a smooth suppression region. For the energy () at which the
spectral flux has fallen to one-half of its extrapolated value in the absence
of suppression, we find
eV.Comment: Replaced with published version. Added journal reference and DO
Energy Estimation of Cosmic Rays with the Engineering Radio Array of the Pierre Auger Observatory
The Auger Engineering Radio Array (AERA) is part of the Pierre Auger
Observatory and is used to detect the radio emission of cosmic-ray air showers.
These observations are compared to the data of the surface detector stations of
the Observatory, which provide well-calibrated information on the cosmic-ray
energies and arrival directions. The response of the radio stations in the 30
to 80 MHz regime has been thoroughly calibrated to enable the reconstruction of
the incoming electric field. For the latter, the energy deposit per area is
determined from the radio pulses at each observer position and is interpolated
using a two-dimensional function that takes into account signal asymmetries due
to interference between the geomagnetic and charge-excess emission components.
The spatial integral over the signal distribution gives a direct measurement of
the energy transferred from the primary cosmic ray into radio emission in the
AERA frequency range. We measure 15.8 MeV of radiation energy for a 1 EeV air
shower arriving perpendicularly to the geomagnetic field. This radiation energy
-- corrected for geometrical effects -- is used as a cosmic-ray energy
estimator. Performing an absolute energy calibration against the
surface-detector information, we observe that this radio-energy estimator
scales quadratically with the cosmic-ray energy as expected for coherent
emission. We find an energy resolution of the radio reconstruction of 22% for
the data set and 17% for a high-quality subset containing only events with at
least five radio stations with signal.Comment: Replaced with published version. Added journal reference and DO
- …