1,996 research outputs found
PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions
As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein-coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multi-species nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF's classification performance in 12-species _Drosophila_ genome alignments exceeds all other methods we compared in a previous study, and we provide a software implementation for use by the community. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues, and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE
Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation
In the traditional object recognition pipeline, descriptors are densely
sampled over an image, pooled into a high dimensional non-linear representation
and then passed to a classifier. In recent years, Fisher Vectors have proven
empirically to be the leading representation for a large variety of
applications. The Fisher Vector is typically taken as the gradients of the
log-likelihood of descriptors, with respect to the parameters of a Gaussian
Mixture Model (GMM). Motivated by the assumption that different distributions
should be applied for different datasets, we present two other Mixture Models
and derive their Expectation-Maximization and Fisher Vector expressions. The
first is a Laplacian Mixture Model (LMM), which is based on the Laplacian
distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian
Mixture Model (HGLMM) which is based on a weighted geometric mean of the
Gaussian and Laplacian distribution. An interesting property of the
Expectation-Maximization algorithm for the latter is that in the maximization
step, each dimension in each component is chosen to be either a Gaussian or a
Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we
achieve state-of-the-art results for both the image annotation and the image
search by a sentence tasks.Comment: new version includes text synthesis by an RNN and experiments with
the COCO benchmar
Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems
Most current methods for identifying coherent structures in
spatially-extended systems rely on prior information about the form which those
structures take. Here we present two new approaches to automatically filter the
changing configurations of spatial dynamical systems and extract coherent
structures. One, local sensitivity filtering, is a modification of the local
Lyapunov exponent approach suitable to cellular automata and other discrete
spatial systems. The other, local statistical complexity filtering, calculates
the amount of information needed for optimal prediction of the system's
behavior in the vicinity of a given point. By examining the changing
spatiotemporal distributions of these quantities, we can find the coherent
structures in a variety of pattern-forming cellular automata, without needing
to guess or postulate the form of that structure. We apply both filters to
elementary and cyclical cellular automata (ECA and CCA) and find that they
readily identify particles, domains and other more complicated structures. We
compare the results from ECA with earlier ones based upon the theory of formal
languages, and the results from CCA with a more traditional approach based on
an order parameter and free energy. While sensitivity and statistical
complexity are equally adept at uncovering structure, they are based on
different system properties (dynamical and probabilistic, respectively), and
provide complementary information.Comment: 16 pages, 21 figures. Figures considerably compressed to fit arxiv
requirements; write first author for higher-resolution version
Multi-view Learning as a Nonparametric Nonlinear Inter-Battery Factor Analysis
Factor analysis aims to determine latent factors, or traits, which summarize
a given data set. Inter-battery factor analysis extends this notion to multiple
views of the data. In this paper we show how a nonlinear, nonparametric version
of these models can be recovered through the Gaussian process latent variable
model. This gives us a flexible formalism for multi-view learning where the
latent variables can be used both for exploratory purposes and for learning
representations that enable efficient inference for ambiguous estimation tasks.
Learning is performed in a Bayesian manner through the formulation of a
variational compression scheme which gives a rigorous lower bound on the log
likelihood. Our Bayesian framework provides strong regularization during
training, allowing the structure of the latent space to be determined
efficiently and automatically. We demonstrate this by producing the first (to
our knowledge) published results of learning from dozens of views, even when
data is scarce. We further show experimental results on several different types
of multi-view data sets and for different kinds of tasks, including exploratory
data analysis, generation, ambiguity modelling through latent priors and
classification.Comment: 49 pages including appendi
- …