1,296 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Similarity Learning for High-Dimensional Sparse Data
A good measure of similarity between data points is crucial to many tasks in
machine learning. Similarity and metric learning methods learn such measures
automatically from data, but they do not scale well respect to the
dimensionality of the data. In this paper, we propose a method that can learn
efficiently similarity measure from high-dimensional sparse data. The core idea
is to parameterize the similarity measure as a convex combination of rank-one
matrices with specific sparsity structures. The parameters are then optimized
with an approximate Frank-Wolfe procedure to maximally satisfy relative
similarity constraints on the training data. Our algorithm greedily
incorporates one pair of features at a time into the similarity measure,
providing an efficient way to control the number of active features and thus
reduce overfitting. It enjoys very appealing convergence guarantees and its
time and memory complexity depends on the sparsity of the data instead of the
dimension of the feature space. Our experiments on real-world high-dimensional
datasets demonstrate its potential for classification, dimensionality reduction
and data exploration.Comment: 14 pages. Proceedings of the 18th International Conference on
Artificial Intelligence and Statistics (AISTATS 2015). Matlab code:
https://github.com/bellet/HDS
Metabolomics : a tool for studying plant biology
In recent years new technologies have allowed gene expression, protein and metabolite profiles in different tissues and developmental stages to be monitored. This is an emerging field in plant science and is applied to diverse plant systems in order to elucidate the regulation of growth and development. The goal in plant metabolomics is to analyze, identify and quantify all low molecular weight molecules of plant organisms. The plant metabolites are extracted and analyzed using various sensitive analytical techniques, usually mass spectrometry (MS) in combination with chromatography. In order to compare the metabolome of different plants in a high through-put manner, a number of biological, analytical and data processing steps have to be performed. In the work underlying this thesis we developed a fast and robust method for routine analysis of plant metabolite patterns using Gas Chromatography-Mass Spectrometry (GC/MS). The method was performed according to Design of Experiment (DOE) to investigate factors affecting the extraction and derivatization of the metabolites from leaves of the plant Arabidopsis thaliana. The outcome of metabolic analysis by GC/MS is a complex mixture of approximately 400 overlapping peaks. Resolving (deconvoluting) overlapping peaks is time-consuming, difficult to automate and additional processing is needed in order to compare samples. To avoid deconvolution being a major bottleneck in high through-put analyses we developed a new semi-automated strategy using hierarchical methods for processing GC/MS data that can be applied to all samples simultaneously. The two methods include base-line correction of the non-processed MS-data files, alignment, time-window determinations, Alternating Regression and multivariate analysis in order to detect metabolites that differ in relative concentrations between samples. The developed methodology was applied to study the effects of the plant hormone GA on the metabolome, with specific emphasis on auxin levels in Arabidopsis thaliana mutants defective in GA biosynthesis and signalling. A large series of plant samples was analysed and the resulting data were processed in less than one week with minimal labour; similar to the time required for the GC/MS analyses of the samples
Variational Characterisations of Separability and Entanglement of Formation
In this paper we develop a mathematical framework for the characterisation of
separability and entanglement of formation (EoF) of general bipartite states.
These characterisations are of the variational kind, meaning that separability
and EoF are given in terms of a function which is to be minimized over the
manifold of unitary matrices. A major benefit of such a characterisation is
that it directly leads to a numerical procedure for calculating EoF. We present
an efficient minimisation algorithm and an apply it to the bound entangled 3X3
Horodecki states; we show that their EoF is very low and that their distance to
the set of separable states is also very low. Within the same variational
framework we rephrase the results by Wootters (W. Wootters, Phys. Rev. Lett.
80, 2245 (1998)) on EoF for 2X2 states and present progress in generalising
these results to higher dimensional systems.Comment: 11 pages RevTeX, 4 figure
Asymptotic inference for semiparametric association models
Association models for a pair of random elements and (e.g., vectors)
are considered which specify the odds ratio function up to an unknown parameter
\bolds\theta. These models are shown to be semiparametric in the sense that
they do not restrict the marginal distributions of and . Inference for
the odds ratio parameter \bolds\theta may be obtained from sampling either
conditionally on or vice versa. Generalizing results from Prentice and
Pyke, Weinberg and Wacholder and Scott and Wild, we show that asymptotic
inference for \bolds\theta under sampling conditional on is the same as
if sampling had been conditional on . Common regression models, for example,
generalized linear models with canonical link or multivariate linear,
respectively, logistic models, are association models where the regression
parameter \bolds\beta is closely related to the odds ratio parameter
\bolds\theta. Hence inference for \bolds\beta may be drawn from samples
conditional on using an association model.Comment: Published in at http://dx.doi.org/10.1214/07-AOS572 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …