429 research outputs found

    Efficient similarity search in high-dimensional data spaces

    Get PDF
    Similarity search in high-dimensional data spaces is a popular paradigm for many modern database applications, such as content based image retrieval, time series analysis in financial and marketing databases, and data mining. Objects are represented as high-dimensional points or vectors based on their important features. Object similarity is then measured by the distance between feature vectors and similarity search is implemented via range queries or k-Nearest Neighbor (k-NN) queries. Implementing k-NN queries via a sequential scan of large tables of feature vectors is computationally expensive. Building multi-dimensional indexes on the feature vectors for k-NN search also tends to be unsatisfactory when the dimensionality is high. This is due to the poor index performance caused by the dimensionality curse. Dimensionality reduction using the Singular Value Decomposition method is the approach adopted in this study to deal with high-dimensional data. Noting that for many real-world datasets, data distribution tends to be heterogeneous, dimensionality reduction on the entire dataset may cause a significant loss of information. More efficient representation is sought by clustering the data into homogeneous subsets of points, and applying dimensionality reduction to each cluster respectively, i.e., utilizing local rather than global dimensionality reduction. The thesis deals with the improvement of the efficiency of query processing associated with local dimensionality reduction methods, such as the Clustering and Singular Value Decomposition (CSVD) and the Local Dimensionality Reduction (LDR) methods. Variations in the implementation of CSVD are considered and the two methods are compared from the viewpoint of the compression ratio, CPU time, and retrieval efficiency. An exact k-NN algorithm is presented for local dimensionality reduction methods by extending an existing multi-step k-NN search algorithm, which is designed for global dimensionality reduction. Experimental results show that the new method requires less CPU time than the approximate method proposed original for CSVD at a comparable level of accuracy. Optimal subspace dimensionality reduction has the intent of minimizing total query cost. The problem is complicated in that each cluster can retain a different number of dimensions. A hybrid method is presented, combining the best features of the CSVD and LDR methods, to find optimal subspace dimensionalities for clusters generated by local dimensionality reduction methods. The experiments show that the proposed method works well for both real-world datasets and synthetic datasets

    Dimensionality Reduced Clustered Data and Order Partition and Stepwise Dimensionality Increasing Indices

    Full text link
    One of the goals of NASA funded project at IBM T. J. Watson Research Center was to build an index for similarity searching satellite images, which were characterized by high-dimensional feature image texture vectors. Reviewed is our effort on data clustering, dimensionality reduction via Singular Value Decomposition - SVD and indexing to build a smaller index and more efficient k-Nearest Neighbor - k-NN query processing for similarity search. k-NN queries based on scanning of the feature vectors of all images is obviously too costly for ever-increasing number of images. The ubiquitous multidimensional R-tree index and its extensions were not an option given their limited scalability dimension-wise. The cost of processing k-NN queries was further reduced by building memory resident Ordered Partition indices on dimensionality reduced clusters. Further research in a university setting included the following: (1) Clustered SVD was extended to yield exact k-NN queries by issuing appropriate less costly range queries, (2) Stepwise Dimensionality Increasing - SDI index outperformed other known indices, (3) selection of optimal number of dimensions to reduce query processing cost, (4) two methods to make the OP-trees persistent and loadable as a single file access

    High-dimensional indexing methods utilizing clustering and dimensionality reduction

    Get PDF
    The emergence of novel database applications has resulted in the prevalence of a new paradigm for similarity search. These applications include multimedia databases, medical imaging databases, time series databases, DNA and protein sequence databases, and many others. Features of data objects are extracted and transformed into high-dimensional data points. Searching for objects becomes a search on points in the high-dimensional feature space. The dissimilarity between two objects is determined by the distance between two feature vectors. Similarity search is usually implemented as nearest neighbor search in feature vector spaces. The cost of processing k-nearest neighbor (k-NN) queries via a sequential scan increases as the number of objects and the number of features increase. A variety of multi-dimensional index structures have been proposed to improve the efficiency of k-NN query processing, which work well in low-dimensional space but lose their efficiency in high-dimensional space due to the curse of dimensionality. This inefficiency is dealt in this study by Clustering and Singular Value Decomposition - CSVD with indexing, Persistent Main Memory - PMM index, and Stepwise Dimensionality Increasing - SDI-tree index. CSVD is an approximate nearest neighbor search method. The performance of CSVD with indexing is studied and the approximation to the distance in original space is investigated. For a given Normalized Mean Square Error - NMSE, the higher the degree of clustering, the higher the recall. However, more clusters require more disk page accesses. Certain number of clusters can be obtained to achieve a higher recall while maintaining a relatively lower query processing cost. Clustering and Indexing using Persistent Main Memory - CIPMM framework is motivated by the following consideration: (a) a significant fraction of index pages are accessed randomly, incurring a high positioning time for each access; (b) disk transfer rate is improving 40% annually, while the improvement in positioning time is only 8%; (c) query processing incurs less CPU time for main memory resident than disk resident indices. CIPMM aims at reducing the elapsed time for query processing by utilizing sequential, rather than random disk accesses. A specific instance of the CIPMM framework CIPOP, indexing using Persistent Ordered Partition - OP-tree, is elaborated and compared with clustering and indexing using the SR-tree, CISR. The results show that CIPOP outperforms CISR, and the higher the dimensionality, the higher the performance gains. The SDI-tree index is motivated by fanouts decrease with dimensionality increasing and shorter vectors reduce cache misses. The index is built by using feature vectors transformed via principal component analysis, resulting in a structure with fewer dimensions at higher levels and increasing the number of dimensions from one level to the other. Dimensions are retained in nonincreasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. Experiments on three datasets have shown that SDL-trees with carefully tuned parameters access fewer disk accesses than SR-trees and VAMSR-trees and incur less CPU time than VA-Files in addition

    AdANNS: A Framework for Adaptive Semantic Search

    Full text link
    Web-scale search systems learn an encoder to embed a given query which is then hooked into an approximate nearest neighbor search (ANNS) pipeline to retrieve similar data points. To accurately capture tail queries and data points, learned representations typically are rigid, high-dimensional vectors that are generally used as-is in the entire ANNS pipeline and can lead to computationally expensive retrieval. In this paper, we argue that instead of rigid representations, different stages of ANNS can leverage adaptive representations of varying capacities to achieve significantly better accuracy-compute trade-offs, i.e., stages of ANNS that can get away with more approximate computation should use a lower-capacity representation of the same data point. To this end, we introduce AdANNS, a novel ANNS design framework that explicitly leverages the flexibility of Matryoshka Representations. We demonstrate state-of-the-art accuracy-compute trade-offs using novel AdANNS-based key ANNS building blocks like search data structures (AdANNS-IVF) and quantization (AdANNS-OPQ). For example on ImageNet retrieval, AdANNS-IVF is up to 1.5% more accurate than the rigid representations-based IVF at the same compute budget; and matches accuracy while being up to 90x faster in wall-clock time. For Natural Questions, 32-byte AdANNS-OPQ matches the accuracy of the 64-byte OPQ baseline constructed using rigid representations -- same accuracy at half the cost! We further show that the gains from AdANNS translate to modern-day composite ANNS indices that combine search structures and quantization. Finally, we demonstrate that AdANNS can enable inference-time adaptivity for compute-aware search on ANNS indices built non-adaptively on matryoshka representations. Code is open-sourced at https://github.com/RAIVNLab/AdANNS.Comment: 25 pages, 15 figures. NeurIPS 2023 camera ready publicatio

    Multi-Variate Time Series Similarity Measures and Their Robustness Against Temporal Asynchrony

    Get PDF
    abstract: The amount of time series data generated is increasing due to the integration of sensor technologies with everyday applications, such as gesture recognition, energy optimization, health care, video surveillance. The use of multiple sensors simultaneously for capturing different aspects of the real world attributes has also led to an increase in dimensionality from uni-variate to multi-variate time series. This has facilitated richer data representation but also has necessitated algorithms determining similarity between two multi-variate time series for search and analysis. Various algorithms have been extended from uni-variate to multi-variate case, such as multi-variate versions of Euclidean distance, edit distance, dynamic time warping. However, it has not been studied how these algorithms account for asynchronous in time series. Human gestures, for example, exhibit asynchrony in their patterns as different subjects perform the same gesture with varying movements in their patterns at different speeds. In this thesis, we propose several algorithms (some of which also leverage metadata describing the relationships among the variates). In particular, we present several techniques that leverage the contextual relationships among the variates when measuring multi-variate time series similarities. Based on the way correlation is leveraged, various weighing mechanisms have been proposed that determine the importance of a dimension for discriminating between the time series as giving the same weight to each dimension can led to misclassification. We next study the robustness of the considered techniques against different temporal asynchronies, including shifts and stretching. Exhaustive experiments were carried on datasets with multiple types and amounts of temporal asynchronies. It has been observed that accuracy of algorithms that rely on data to discover variate relationships can be low under the presence of temporal asynchrony, whereas in case of algorithms that rely on external metadata, robustness against asynchronous distortions tends to be stronger. Specifically, algorithms using external metadata have better classification accuracy and cluster separation than existing state-of-the-art work, such as EROS, PCA, and naive dynamic time warping.Dissertation/ThesisMasters Thesis Computer Science 201

    Techniques For Boosting The Performance In Content-based Image Retrieval Systems

    Get PDF
    Content-Based Image Retrieval has been an active research area for decades. In a CBIR system, one or more images are used as query to search for similar images. The similarity is measured on the low level features, such as color, shape, edge, texture. First, each image is processed and visual features are extract. Therefore each image becomes a point in the feature space. Then, if two images are close to each other in the feature space, they are considered similar. That is, the k nearest neighbors are considered the most similar images to the query image. In this K-Nearest Neighbor (k-NN) model, semantically similar images are assumed to be clustered together in a single neighborhood in the high-dimensional feature space. Unfortunately semantically similar images with different appearances are often clustered into distinct neighborhoods, which might scatter in the feature space. Hence, confinement of the search results to a single neighborhood is the latent reason of the low recall rate of typical nearest neighbor techniques. In this dissertation, a new image retrieval technique - the Query Decomposition (QD) model is introduced. QD facilitates retrieval of semantically similar images from multiple neighborhoods in the feature space and hence bridges the semantic gap between the images’ low-level feature and the high-level semantic meaning. In the QD model, a query may be decomposed into multiple subqueries based on the user’s relevance feedback to cover multiple image clusters which contain semantically similar images. The retrieval results are the k most similar images from multiple discontinuous relevant clusters. To apply the benifit from QD study, a mobile client-side relevance feedback study was conducted. With the proliferation of handheld devices, the demand of multimedia information retrieval on mobile devices has attracted more attention. A relevance feedback information retrieval process usually includes several rounds of query refinement. Each round incurs exchange of tens of images between the mobile device and the server. With limited wireless bandwidth, this process can incur substantial delay making the system unfriendly iii to use. The Relevance Feedback Support (RFS) structure that was designed in QD technique was adopted for Client-side Relevance Feedback (CRF). Since relevance feedback is done on client side, system response is instantaneous significantly enhancing system usability. Furthermore, since the server is not involved in relevance feedback processing, it is able to support thousands more users simultaneously. As the QD technique improves on the accuracy of CBIR systems, another study, which is called In-Memory relevance feedback is studied in this dissertation. In the study, we improved the efficiency of the CBIR systems. Current methods rely on searching the database, stored on disks, in each round of relevance feedback. This strategy incurs long delay making relevance feedback less friendly to the user, especially for very large databases. Thus, scalability is a limitation of existing solutions. The proposed in-memory relevance feedback technique substantially reduce the delay associated with feedback processing, and therefore improve system usability. A data-independent dimensionality-reduction technique is used to compress the metadata to build a small in-memory database to support relevance feedback operations with minimal disk accesses. The performance of this approach is compared with conventional relevance feedback techniques in terms of computation efficiency and retrieval accuracy. The results indicate that the new technique substantially reduces response time for user feedback while maintaining the quality of the retrieval. In the previous studies, the QD technique relies on a pre-defined Relevance Support Support structure. As the result and user experience indicated that the structure might confine the search range and affect the result. In this dissertation, a novel Multiple Direction Search framework for semi-automatic annotation propagation is studied. In this system, the user interacts with the system to provide example images and the corresponding annotations during the annotation propagation process. In each iteration, the example images are dynamically clustered and the corresponding annotations are propagated separately to each cluster: images in the local neighborhood are annotated. Furthermore, some of those images are returned to the user for further annotation. As the user marks more images, iv the annotation process goes into multiple directions in the feature space. The query movements can be treated as multiple path navigation. Each path could be further split based on the user’s input. In this manner, the system provides accurate annotation assistance to the user - images with the same semantic meaning but different visual characteristics can be handled effectively. From comprehensive experiments on Corel and U. of Washington image databases, the proposed technique shows accuracy and efficiency on annotating image databases

    Exploratory Search on Mobile Devices

    Get PDF
    The goal of this thesis is to provide a general framework (MobEx) for exploratory search especially on mobile devices. The central part is the design, implementation, and evaluation of several core modules for on-demand unsupervised information extraction well suited for exploratory search on mobile devices and creating the MobEx framework. These core processing elements, combined with a multitouch - able user interface specially designed for two families of mobile devices, i.e. smartphones and tablets, have been finally implemented in a research prototype. The initial information request, in form of a query topic description, is issued online by a user to the system. The system then retrieves web snippets by using standard search engines. These snippets are passed through a chain of NLP components which perform an ondemand or ad-hoc interactive Query Disambiguation, Named Entity Recognition, and Relation Extraction task. By on-demand or ad-hoc we mean the components are capable to perform their operations on an unrestricted open domain within special time constraints. The result of the whole process is a topic graph containing the detected associated topics as nodes and the extracted relation ships as labelled edges between the nodes. The Topic Graph is presented to the user in different ways depending on the size of the device she is using. Various evaluations have been conducted that help us to understand the potentials and limitations of the framework and the prototype
    • …
    corecore