80 research outputs found

    R-Forest for Approximate Nearest Neighbor Queries in High Dimensional Space

    Get PDF
    Searching high dimensional space has been a challenge and an area of intense research for many years. The dimensionality curse has rendered most existing index methods all but useless causing people to research other techniques. In my dissertation I will try to resurrect one of the best known index structures, R-Tree, which most have given up on as a viable method of answering high dimensional queries. I have pointed out the various advantages of R-Tree as a method for answering approximate nearest neighbor queries, and the advantages of locality sensitive hashing and locality sensitive B-Tree, which are the most successful methods today. I started by looking at improving the maintenance of R-Tree by the use of bulk loading and insertion. I proposed and implemented a new method that bulk loads the index which was an improvement of standard method. I then turned my attention to nearest neighbor queries, which is a much more challenging problem especially in high dimensional space. Initially I developed a set of heuristics, easily implemented in R-Tree, which improved the efficiency of high dimensional approximate nearest neighbor queries. To further refine my method I took another approach, by developing a new model, known as R-Forest, which takes advantage of space partitioning while still using R-Tree as its index structure. With this new approach I was able to implement new heuristics and can show that R-Forest, comprised of a set of R-Trees, is a viable solution tohigh dimensional approximate nearest neighbor queries when compared to established methods

    Subseries Join and Compression of Time Series Data Based on Non-uniform Segmentation

    Get PDF
    A time series is composed of a sequence of data items that are measured at uniform intervals. Many application areas generate or manipulate time series, including finance, medicine, digital audio, and motion capture. Efficiently searching a large time series database is still a challenging problem, especially when partial or subseries matches are needed. This thesis proposes a new denition of subseries join, a symmetric generalization of subseries matching, which finds similar subseries in two or more time series datasets. A solution is proposed to compute the subseries join based on a hierarchical feature representation. This hierarchical feature representation is generated by an anisotropic diffusion scale-space analysis and a non-uniform segmentation method. Each segment is represented by a minimal polynomial envelope in a reduced-dimensionality space. Based on the hierarchical feature representation, all features in a dataset are indexed in an R-tree, and candidate matching features of two datasets are found by an R-tree join operation. Given candidate matching features, a dynamic programming algorithm is developed to compute the final subseries join. To improve storage efficiency, a hierarchical compression scheme is proposed to compress features. The minimal polynomial envelope representation is transformed to a Bezier spline envelope representation. The control points of each Bezier spline are then hierarchically differenced and an arithmetic coding is used to compress these differences. To empirically evaluate their effectiveness, the proposed subseries join and compression techniques are tested on various publicly available datasets. A large motion capture database is also used to verify the techniques in a real-world application. The experiments show that the proposed subseries join technique can better tolerate noise and local scaling than previous work, and the proposed compression technique can also achieve about 85% higher compression rates than previous work with the same distortion error

    Content-Based Image Retrieval Using Self-Organizing Maps

    Full text link

    Coping with distance and location dependencies in spatial, temporal and uncertain data

    Get PDF

    Intelligent Data Analytics using Deep Learning for Data Science

    Get PDF
    Nowadays, data science stimulates the interest of academics and practitioners because it can assist in the extraction of significant insights from massive amounts of data. From the years 2018 through 2025, the Global Datasphere is expected to rise from 33 Zettabytes to 175 Zettabytes, according to the International Data Corporation. This dissertation proposes an intelligent data analytics framework that uses deep learning to tackle several difficulties when implementing a data science application. These difficulties include dealing with high inter-class similarity, the availability and quality of hand-labeled data, and designing a feasible approach for modeling significant correlations in features gathered from various data sources. The proposed intelligent data analytics framework employs a novel strategy for improving data representation learning by incorporating supplemental data from various sources and structures. First, the research presents a multi-source fusion approach that utilizes confident learning techniques to improve the data quality from many noisy sources. Meta-learning methods based on advanced techniques such as the mixture of experts and differential evolution combine the predictive capacity of individual learners with a gating mechanism, ensuring that only the most trustworthy features or predictions are integrated to train the model. Then, a Multi-Level Convolutional Fusion is presented to train a model on the correspondence between local-global deep feature interactions to identify easily confused samples of different classes. The convolutional fusion is further enhanced with the power of Graph Transformers, aggregating the relevant neighboring features in graph-based input data structures and achieving state-of-the-art performance on a large-scale building damage dataset. Finally, weakly-supervised strategies, noise regularization, and label propagation are proposed to train a model on sparse input labeled data, ensuring the model\u27s robustness to errors and supporting the automatic expansion of the training set. The suggested approaches outperformed competing strategies in effectively training a model on a large-scale dataset of 500k photos, with just about 7% of the images annotated by a human. The proposed framework\u27s capabilities have benefited various data science applications, including fluid dynamics, geometric morphometrics, building damage classification from satellite pictures, disaster scene description, and storm-surge visualization

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Π-Avida -A Personalized Interactive Audio and Video Portal

    Get PDF
    Abstract We describe a system for enregistering, storing and distributing multimedia data streams. For each modality -audio, speech, video -characteristic features are extracted and used to classify the content into a range of topic categories. Using data mining techniques classifier models are determined from training data. These models are able to assign existing and new multimedia documents to one or several topic categories. We describe the features used as inputs for these classifiers. We demonstrate that the classification of audio material may be improved by using phonemes and syllables instead of words. Finally we show that the categorization performance mainly depends on the quality of speech recognition and that the simple video features we tested are of only marginal utility

    Toward autonomous harbor surveillance

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Includes bibliographical references (p. 105-113).In this thesis we address the problem of drift-free navigation for underwater vehicles performing harbor surveillance and ship hull inspection. Maintaining accurate localization for the duration of a mission is important for a variety of tasks, such as planning the vehicle trajectory and ensuring coverage of the area to be inspected. Our approach uses only onboard sensors in a simultaneous localization and mapping setting and removes the need for any external infrastructure like acoustic beacons. We extract dense features from a forward-looking imaging sonar and apply pair-wise registration between sonar frames. The registrations are combined with onboard velocity, attitude and acceleration sensors to obtain an improved estimate of the vehicle trajectory. In addition, an architecture for a persistent mapping is proposed. With the intention of handling long term operations and repetitive surveillance tasks. The proposed architecture is flexible and supports different types of vehicles and mapping methods. The design of the system is demonstrated with an implementation of some of the key features of the system. In addition, methods for re-localization are considered. Finally, results from several experiments that demonstrate drift-free navigation in various underwater environments are presented.by Hordur Johannsson.S.M

    High Dimensional Data Set Analysis Using a Large-Scale Manifold Learning Approach

    Get PDF
    Because of technological advances, a trend occurs for data sets increasing in size and dimensionality. Processing these large scale data sets is challenging for conventional computers due to computational limitations. A framework for nonlinear dimensionality reduction on large databases is presented that alleviates the issue of large data sets through sampling, graph construction, manifold learning, and embedding. Neighborhood selection is a key step in this framework and a potential area of improvement. The standard approach to neighborhood selection is setting a fixed neighborhood. This could be a fixed number of neighbors or a fixed neighborhood size. Each of these has its limitations due to variations in data density. A novel adaptive neighbor-selection algorithm is presented to enhance performance by incorporating sparse â„“ 1-norm based optimization. These enhancements are applied to the graph construction and embedding modules of the original framework. As validation of the proposed â„“1-based enhancement, experiments are conducted on these modules using publicly available benchmark data sets. The two approaches are then applied to a large scale magnetic resonance imaging (MRI) data set for brain tumor progression prediction. Results showed that the proposed approach outperformed linear methods and other traditional manifold learning algorithms
    • …
    corecore