384 research outputs found
Objective Classification of Galaxy Spectra using the Information Bottleneck Method
A new method for classification of galaxy spectra is presented, based on a
recently introduced information theoretical principle, the `Information
Bottleneck'. For any desired number of classes, galaxies are classified such
that the information content about the spectra is maximally preserved. The
result is classes of galaxies with similar spectra, where the similarity is
determined via a measure of information. We apply our method to approximately
6000 galaxy spectra from the ongoing 2dF redshift survey, and a mock-2dF
catalogue produced by a Cold Dark Matter-based semi-analytic model of galaxy
formation. We find a good match between the mean spectra of the classes found
in the data and in the models. For the mock catalogue, we find that the classes
produced by our algorithm form an intuitively sensible sequence in terms of
physical properties such as colour, star formation activity, morphology, and
internal velocity dispersion. We also show the correlation of the classes with
the projections resulting from a Principal Component Analysis.Comment: submitted to MNRAS, 17 pages, Latex, with 14 figures embedde
Information based clustering
In an age of increasingly large data sets, investigators in many different
disciplines have turned to clustering as a tool for data analysis and
exploration. Existing clustering methods, however, typically depend on several
nontrivial assumptions about the structure of data. Here we reformulate the
clustering problem from an information theoretic perspective which avoids many
of these assumptions. In particular, our formulation obviates the need for
defining a cluster "prototype", does not require an a priori similarity metric,
is invariant to changes in the representation of the data, and naturally
captures non-linear relations. We apply this approach to different domains and
find that it consistently produces clusters that are more coherent than those
extracted by existing algorithms. Finally, our approach provides a way of
clustering based on collective notions of similarity rather than the
traditional pairwise measures.Comment: To appear in Proceedings of the National Academy of Sciences USA, 11
pages, 9 figure
Propagation of charged particle waves in a uniform magnetic field
This paper considers the probability density and current distributions
generated by a point-like, isotropic source of monoenergetic charges embedded
into a uniform magnetic field environment. Electron sources of this kind have
been realized in recent photodetachment microscopy experiments. Unlike the
total photocurrent cross section, which is largely understood, the spatial
profiles of charge and current emitted by the source display an unexpected
hierarchy of complex patterns, even though the distributions, apart from
scaling, depend only on a single physical parameter. We examine the electron
dynamics both by solving the quantum problem, i. e., finding the energy Green
function, and from a semiclassical perspective based on the simple cyclotron
orbits followed by the electron. Simulations suggest that the semiclassical
method, which involves here interference between an infinite set of paths,
faithfully reproduces the features observed in the quantum solution, even in
extreme circumstances, and lends itself to an interpretation of some (though
not all) of the rich structure exhibited in this simple problem.Comment: 39 pages, 16 figure
Motif Discovery through Predictive Modeling of Gene Regulation
We present MEDUSA, an integrative method for learning motif models of
transcription factor binding sites by incorporating promoter sequence and gene
expression data. We use a modern large-margin machine learning approach, based
on boosting, to enable feature selection from the high-dimensional search space
of candidate binding sequences while avoiding overfitting. At each iteration of
the algorithm, MEDUSA builds a motif model whose presence in the promoter
region of a gene, coupled with activity of a regulator in an experiment, is
predictive of differential expression. In this way, we learn motifs that are
functional and predictive of regulatory response rather than motifs that are
simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model
of the transcriptional control logic that can predict the expression of any
gene in the organism, given the sequence of the promoter region of the target
gene and the expression state of a set of known or putative transcription
factors and signaling molecules. Each motif model is either a -length
sequence, a dimer, or a PSSM that is built by agglomerative probabilistic
clustering of sequences with similar boosting loss. By applying MEDUSA to a set
of environmental stress response expression data in yeast, we learn motifs
whose ability to predict differential expression of target genes outperforms
motifs from the TRANSFAC dataset and from a previously published candidate set
of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed
binding sites associated with environmental stress response from the
literature.Comment: RECOMB 200
Psoriasis prediction from genome-wide SNP profiles
<p>Abstract</p> <p>Background</p> <p>With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data.</p> <p>Methods</p> <p>Totally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance.</p> <p>Results</p> <p>The best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study.</p> <p>Conclusions</p> <p>The fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.</p
Discrete profile comparison using information bottleneck
Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible
Systems biology via redescription and ontologies (I): finding phase changes with applications to malaria temporal data
Biological systems are complex and often composed of many subtly interacting components. Furthermore, such systems evolve through time and, as the underlying biology executes its genetic program, the relationships between components change and undergo dynamic reorganization. Characterizing these relationships precisely is a challenging task, but one that must be undertaken if we are to understand these systems in sufficient detail. One set of tools that may prove useful are the formal principles of model building and checking, which could allow the biologist to frame these inherently temporal questions in a sufficiently rigorous framework. In response to these challenges, GOALIE (Gene ontology algorithmic logic and information extractor) was developed and has been successfully employed in the analysis of high throughput biological data (e.g. time-course gene-expression microarray data and neural spike train recordings). The method has applications to a wide variety of temporal data, indeed any data for which there exist ontological descriptions. This paper describes the algorithms behind GOALIE and its use in the study of the Intraerythrocytic Developmental Cycle (IDC) of Plasmodium falciparum, the parasite responsible for a deadly form of chloroquine resistant malaria. We focus in particular on the problem of finding phase changes, times of reorganization of transcriptional control
Ballistic matter waves with angular momentum: Exact solutions and applications
An alternative description of quantum scattering processes rests on
inhomogeneous terms amended to the Schroedinger equation. We detail the
structure of sources that give rise to multipole scattering waves of definite
angular momentum, and introduce pointlike multipole sources as their limiting
case. Partial wave theory is recovered for freely propagating particles. We
obtain novel results for ballistic scattering in an external uniform force
field, where we provide analytical solutions for both the scattering waves and
the integrated particle flux. Our theory directly applies to p-wave
photodetachment in an electric field. Furthermore, illustrating the effects of
extended sources, we predict some properties of vortex-bearing atom laser beams
outcoupled from a rotating Bose-Einstein condensate under the influence of
gravity.Comment: 42 pages, 8 figures, extended version including photodetachment and
semiclassical theor
Paradigm of tunable clustering using binarization of consensus partition matrices (Bi-CoPaM) for gene discovery
Copyright @ 2013 Abu-Jamous et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight clusters that focus on their cores or wide clusters that overlap and contain all possibly relevant genes. In this paper, a new clustering paradigm is proposed. In this paradigm, all three eventualities of a gene being exclusively assigned to a single cluster, being assigned to multiple clusters, and being not assigned to any cluster are possible. These possibilities are realised through the primary novelty of the introduction of tunable binarization techniques. Results from multiple clustering experiments are aggregated to generate one fuzzy consensus partition matrix (CoPaM), which is then binarized to obtain the final binary partitions. This is referred to as Binarization of Consensus Partition Matrices (Bi-CoPaM). The method has been tested with a set of synthetic datasets and a set of five real yeast cell-cycle datasets. The results demonstrate its validity in generating relevant tight, wide, and complementary clusters that can meet requirements of different gene discovery studies.National Institute for Health Researc
The structure of the PapD-PapGII pilin complex reveals an open and flexible P5 pocket
P pili are hairlike polymeric structures that mediate binding of uropathogenic Escherichia coli to the surface of the kidney via the PapG adhesin at their tips. PapG is composed of two domains: a lectin domain at the tip of the pilus followed by a pilin domain that comprises the initial polymerizing subunit of the 1,000-plus-subunit heteropolymeric pilus fiber. Prior to assembly, periplasmic pilin domains bind to a chaperone, PapD. PapD mediates donor strand complementation, in which a beta strand of PapD temporarily completes the pilin domain's fold, preventing premature, nonproductive interactions with other pilin subunits and facilitating subunit folding. Chaperone-subunit complexes are delivered to the outer membrane usher where donor strand exchange (DSE) replaces PapD's donated beta strand with an amino-terminal extension on the next incoming pilin subunit. This occurs via a zip-in-zip-out mechanism that initiates at a relatively accessible hydrophobic space termed the P5 pocket on the terminally incorporated pilus subunit. Here, we solve the structure of PapD in complex with the pilin domain of isoform II of PapG (PapGIIp). Our data revealed that PapGIIp adopts an immunoglobulin fold with a missing seventh strand, complemented in parallel by the G1 PapD strand, typical of pilin subunits. Comparisons with other chaperone-pilin complexes indicated that the interactive surfaces are highly conserved. Interestingly, the PapGIIp P5 pocket was in an open conformation, which, as molecular dynamics simulations revealed, switches between an open and a closed conformation due to the flexibility of the surrounding loops. Our study reveals the structural details of the DSE mechanism
- …