6 research outputs found

    Accelerating Sequence Searching: Dimensionality Reduction Method

    Get PDF
    Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM - tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.Computer Science, Artificial IntelligenceComputer Science, Information SystemsSCI(E)6ARTICLE3301-3222

    NNMap: A method to construct a good embedding for nearest neighbor classification

    Get PDF
    a b s t r a c t This paper aims to deal with the practical shortages of nearest neighbor classifier. We define a quantitative criterion of embedding quality assessment for nearest neighbor classification, and present a method called NNMap to construct a good embedding. Furthermore, an efficient distance is obtained in the embedded vector space, which could speed up nearest neighbor classification. The quantitative quality criterion is proposed as a local structure descriptor of sample data distribution. Embedding quality corresponds to the quality of the local structure. In the framework of NNMap, one-dimension embeddings act as weak classifiers with pseudo-losses defined on the amount of the local structure preserved by the embedding. Based on this property, the NNMap method reduces the problem of embedding construction to the classical boosting problem. An important property of NNMap is that the embedding optimization criterion is appropriate for both vector and non-vector data, and equally valid in both metric and non-metric spaces. The effectiveness of the new method is demonstrated by experiments conducted on the MNIST handwritten dataset, the CMU PIE face images dataset and the datasets from UCI machine learning repository

    Extração e Combinação por Similaridade: Um estudo de caso nas redes de supermercados em Florianópolis

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Sistemas de Informação.Com a ascensão do comércio eletrônico, para bens de consumo básicos, a quantidade de ofertas na qual o consumidor final pode optar ao realizar compras de Supermercados aumentou muito. Considerando o cenário econômico brasileiro, cada vez mais se opta por formas de realizar economia financeira. Hoje em dia, é possível encontrar diversas ferramentas e plataformas online que realizam comparações de preços de produtos, mas a maioria das ferramentas disponíveis são focadas em eletrônicos, eletrodomésticos e outros bens que não são consumíveis. A proposta deste trabalho de conclusão de curso consiste em criar uma ferramenta para auxiliar a identificar, catalogar e classificar produtos similares através de diversos pontos de vendas de supermercados de Florianópolis. A solução proposta conta com o desenvolvimento de uma API que age como um Web scraper a fim de realizar a extração de preços de produtos de supermercados, que disponibilizam serviço de vendas online, localizados em Florianópolis. Os dados passam por um processo de transformação e normalização. Após o pré-processamento, os dados são processados por etapas definidas como integração e indexação, onde os dados extraídos são comparados através de algoritmos de similaridades a fim de combinar os produtos identificados como similares e salvar estas combinações. A ferramenta irá construir uma base de dados indexada e consolidada que facilitará a comparação de preços entre diferentes supermercados.With the rise of electronic commerce, for basic consumer goods, the number of websites offer in which the final consumer can choose when making purchases from groceries stores has increased. Considering the Brazilian economic scenario, often more ways are being chosen to achieve financial savings. Nowadays, you can find several online tools and platforms that perform product price comparisons, but most of the tools available are focused on electronics, appliances and other goods that are not consumables. The purpose of this article is to develop a tool that will help to identify, catalog and classify similar products through different groceries stores. An api that acts as a Web scraper in order to extract prices from online retailers’ products, located in Florianópolis. The data will undergo a process of transformation and normalization then will the data goes through the steps of indexing and integrations, in order to identify and combine similar products. The tool will build an indexed database that will facilitate the comparison of prices between different groceries stores

    Aspects of Metric Spaces in Computation

    Get PDF
    Metric spaces, which generalise the properties of commonly-encountered physical and abstract spaces into a mathematical framework, frequently occur in computer science applications. Three major kinds of questions about metric spaces are considered here: the intrinsic dimensionality of a distribution, the maximum number of distance permutations, and the difficulty of reverse similarity search. Intrinsic dimensionality measures the tendency for points to be equidistant, which is diagnostic of high-dimensional spaces. Distance permutations describe the order in which a set of fixed sites appears while moving away from a chosen point; the number of distinct permutations determines the amount of storage space required by some kinds of indexing data structure. Reverse similarity search problems are constraint satisfaction problems derived from distance-based index structures. Their difficulty reveals details of the structure of the space. Theoretical and experimental results are given for these three questions in a wide range of metric spaces, with commentary on the consequences for computer science applications and additional related results where appropriate

    Distance Based Indexing for String Proximity Search

    No full text
    In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goa
    corecore