30,392 research outputs found
Natural similarity measures between position frequency matrices with an application to clustering
Motivation: Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs. Results: We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons. Availability: A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.d
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods
Measuring the similarity of short written contexts is a fundamental problem
in Natural Language Processing. This article provides a unifying framework by
which short context problems can be categorized both by their intended
application and proposed solution. The goal is to show that various problems
and methodologies that appear quite different on the surface are in fact very
closely related. The axes by which these categorizations are made include the
format of the contexts (headed versus headless), the way in which the contexts
are to be measured (first-order versus second-order similarity), and the
information used to represent the features in the contexts (micro versus macro
views). The unifying thread that binds together many short context applications
and methods is the fact that similarity decisions must be made between contexts
that share few (if any) words in common.Comment: 23 page
Text Mining Infrastructure in R
During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.
MEMOFinder: combining _de_ _novo_ motif prediction methods with a database of known motifs
*Background:* Methods for finding overrepresented sequence motifs are useful in several key areas of computational biology. They aim at detecting very weak signals responsible for biological processes requiring robust sequence identification like transcription-factor binding to DNA or docking sites in proteins. Currently, general performance of the model-based motif-finding methods is unsatisfactory; however, different methods are successful in different cases. This leads to the practical problem of combining results of different motif-finding tools, taking into account current knowledge collected in motif databases.
*Results:* We propose a new complete service allowing researchers to submit their sequences for analysis by four different motif-finding methods for clustering and comparison with a reference motif database. It is tailored for regulatory motif detection, however it allows for substantial amount of configuration regarding sequence background, motif database and parameters for motif-finding methods.
*Availability:* The method is available online as a webserver at: http://bioputer.mimuw.edu.pl/software/mmf/. In addition, the source code is released on a GNU General Public License
Distinguishing Word Senses in Untagged Text
This paper describes an experimental comparison of three unsupervised
learning algorithms that distinguish the sense of an ambiguous word in untagged
text. The methods described in this paper, McQuitty's similarity analysis,
Ward's minimum-variance method, and the EM algorithm, assign each instance of
an ambiguous word to a known sense definition based solely on the values of
automatically identifiable features in text. These methods and feature sets are
found to be more successful in disambiguating nouns rather than adjectives or
verbs. Overall, the most accurate of these procedures is McQuitty's similarity
analysis in combination with a high dimensional feature set.Comment: 11 pages, latex, uses aclap.st
- …