1,296 research outputs found

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    Similarity Learning for High-Dimensional Sparse Data

    Get PDF
    A good measure of similarity between data points is crucial to many tasks in machine learning. Similarity and metric learning methods learn such measures automatically from data, but they do not scale well respect to the dimensionality of the data. In this paper, we propose a method that can learn efficiently similarity measure from high-dimensional sparse data. The core idea is to parameterize the similarity measure as a convex combination of rank-one matrices with specific sparsity structures. The parameters are then optimized with an approximate Frank-Wolfe procedure to maximally satisfy relative similarity constraints on the training data. Our algorithm greedily incorporates one pair of features at a time into the similarity measure, providing an efficient way to control the number of active features and thus reduce overfitting. It enjoys very appealing convergence guarantees and its time and memory complexity depends on the sparsity of the data instead of the dimension of the feature space. Our experiments on real-world high-dimensional datasets demonstrate its potential for classification, dimensionality reduction and data exploration.Comment: 14 pages. Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS 2015). Matlab code: https://github.com/bellet/HDS

    Metabolomics : a tool for studying plant biology

    Get PDF
    In recent years new technologies have allowed gene expression, protein and metabolite profiles in different tissues and developmental stages to be monitored. This is an emerging field in plant science and is applied to diverse plant systems in order to elucidate the regulation of growth and development. The goal in plant metabolomics is to analyze, identify and quantify all low molecular weight molecules of plant organisms. The plant metabolites are extracted and analyzed using various sensitive analytical techniques, usually mass spectrometry (MS) in combination with chromatography. In order to compare the metabolome of different plants in a high through-put manner, a number of biological, analytical and data processing steps have to be performed. In the work underlying this thesis we developed a fast and robust method for routine analysis of plant metabolite patterns using Gas Chromatography-Mass Spectrometry (GC/MS). The method was performed according to Design of Experiment (DOE) to investigate factors affecting the extraction and derivatization of the metabolites from leaves of the plant Arabidopsis thaliana. The outcome of metabolic analysis by GC/MS is a complex mixture of approximately 400 overlapping peaks. Resolving (deconvoluting) overlapping peaks is time-consuming, difficult to automate and additional processing is needed in order to compare samples. To avoid deconvolution being a major bottleneck in high through-put analyses we developed a new semi-automated strategy using hierarchical methods for processing GC/MS data that can be applied to all samples simultaneously. The two methods include base-line correction of the non-processed MS-data files, alignment, time-window determinations, Alternating Regression and multivariate analysis in order to detect metabolites that differ in relative concentrations between samples. The developed methodology was applied to study the effects of the plant hormone GA on the metabolome, with specific emphasis on auxin levels in Arabidopsis thaliana mutants defective in GA biosynthesis and signalling. A large series of plant samples was analysed and the resulting data were processed in less than one week with minimal labour; similar to the time required for the GC/MS analyses of the samples

    Variational Characterisations of Separability and Entanglement of Formation

    Get PDF
    In this paper we develop a mathematical framework for the characterisation of separability and entanglement of formation (EoF) of general bipartite states. These characterisations are of the variational kind, meaning that separability and EoF are given in terms of a function which is to be minimized over the manifold of unitary matrices. A major benefit of such a characterisation is that it directly leads to a numerical procedure for calculating EoF. We present an efficient minimisation algorithm and an apply it to the bound entangled 3X3 Horodecki states; we show that their EoF is very low and that their distance to the set of separable states is also very low. Within the same variational framework we rephrase the results by Wootters (W. Wootters, Phys. Rev. Lett. 80, 2245 (1998)) on EoF for 2X2 states and present progress in generalising these results to higher dimensional systems.Comment: 11 pages RevTeX, 4 figure

    Asymptotic inference for semiparametric association models

    Full text link
    Association models for a pair of random elements XX and YY (e.g., vectors) are considered which specify the odds ratio function up to an unknown parameter \bolds\theta. These models are shown to be semiparametric in the sense that they do not restrict the marginal distributions of XX and YY. Inference for the odds ratio parameter \bolds\theta may be obtained from sampling either YY conditionally on XX or vice versa. Generalizing results from Prentice and Pyke, Weinberg and Wacholder and Scott and Wild, we show that asymptotic inference for \bolds\theta under sampling conditional on YY is the same as if sampling had been conditional on XX. Common regression models, for example, generalized linear models with canonical link or multivariate linear, respectively, logistic models, are association models where the regression parameter \bolds\beta is closely related to the odds ratio parameter \bolds\theta. Hence inference for \bolds\beta may be drawn from samples conditional on YY using an association model.Comment: Published in at http://dx.doi.org/10.1214/07-AOS572 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore