192 research outputs found
Helmholtzian Eigenmap: Topological feature discovery & edge flow learning from point cloud data
The manifold Helmholtzian (1-Laplacian) operator elegantly
generalizes the Laplace-Beltrami operator to vector fields on a manifold
. In this work, we propose the estimation of the manifold
Helmholtzian from point cloud data by a weighted 1-Laplacian . While higher order Laplacians ave been introduced and studied, this work
is the first to present a graph Helmholtzian constructed from a simplicial
complex as an estimator for the continuous operator in a non-parametric
setting. Equipped with the geometric and topological information about
, the Helmholtzian is a useful tool for the analysis of flows and
vector fields on via the Helmholtz-Hodge theorem. In addition, the
allows the smoothing, prediction, and feature
extraction of the flows. We demonstrate these possibilities on substantial sets
of synthetic and real point cloud datasets with non-trivial topological
structures; and provide theoretical results on the limit of to
Learning with mistures of trees
Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 125-129).by Marina Meilă-Predoviciu.Ph.D
A Markov model for inferring flows in directed contact networks
Directed contact networks (DCNs) are a particularly flexible and convenient
class of temporal networks, useful for modeling and analyzing the transfer of
discrete quantities in communications, transportation, epidemiology, etc.
Transfers modeled by contacts typically underlie flows that associate multiple
contacts based on their spatiotemporal relationships. To infer these flows, we
introduce a simple inhomogeneous Markov model associated to a DCN and show how
it can be effectively used for data reduction and anomaly detection through an
example of kernel-level information transfers within a computer.Comment: 12 page
An Approach to Web-Scale Named-Entity Disambiguation
We present a multi-pass clustering approach to large scale. wide-scope named-entity disambiguation (NED) oil collections of web pages. Our approach Uses name co-occurrence information to cluster and hence disambiguate entities. and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasing), difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information front documents
Defining functional distances over Gene Ontology
<p>Abstract</p> <p>Background</p> <p>A fundamental problem when trying to define the functional relationships between proteins is the difficulty in quantifying functional similarities, even when well-structured ontologies exist regarding the activity of proteins (i.e. 'gene ontology' -GO-). However, functional metrics can overcome the problems in the comparing and evaluating functional assignments and predictions. As a reference of proximity, previous approaches to compare GO terms considered linkage in terms of ontology weighted by a probability distribution that balances the non-uniform 'richness' of different parts of the Direct Acyclic Graph. Here, we have followed a different approach to quantify functional similarities between GO terms.</p> <p>Results</p> <p>We propose a new method to derive 'functional distances' between GO terms that is based on the simultaneous occurrence of terms in the same set of Interpro entries, instead of relying on the structure of the GO. The coincidence of GO terms reveals natural biological links between the GO functions and defines a distance model <it>D</it><sub><it>f </it></sub>which fulfils the properties of a Metric Space. The distances obtained in this way can be represented as a hierarchical 'Functional Tree'.</p> <p>Conclusion</p> <p>The method proposed provides a new definition of distance that enables the similarity between GO terms to be quantified. Additionally, the 'Functional Tree' defines groups with biological meaning enhancing its utility for protein function comparison and prediction. Finally, this approach could be for function-based protein searches in databases, and for analysing the gene clusters produced by DNA array experiments.</p
Deciphering Network Community Structure by Surprise
The analysis of complex networks permeates all sciences, from biology to
sociology. A fundamental, unsolved problem is how to characterize the community
structure of a network. Here, using both standard and novel benchmarks, we show
that maximization of a simple global parameter, which we call Surprise (S),
leads to a very efficient characterization of the community structure of
complex synthetic networks. Particularly, S qualitatively outperforms the most
commonly used criterion to define communities, Newman and Girvan's modularity
(Q). Applying S maximization to real networks often provides natural,
well-supported partitions, but also sometimes counterintuitive solutions that
expose the limitations of our previous knowledge. These results indicate that
it is possible to define an effective global criterion for community structure
and open new routes for the understanding of complex networks.Comment: 7 pages, 5 figure
A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm
K-means is undoubtedly the most widely used partitional clustering algorithm.
Unfortunately, due to its gradient descent nature, this algorithm is highly
sensitive to the initial placement of the cluster centers. Numerous
initialization methods have been proposed to address this problem. In this
paper, we first present an overview of these methods with an emphasis on their
computational efficiency. We then compare eight commonly used linear time
complexity initialization methods on a large and diverse collection of data
sets using various performance criteria. Finally, we analyze the experimental
results using non-parametric statistical tests and provide recommendations for
practitioners. We demonstrate that popular initialization methods often perform
poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table
- …