3 research outputs found

    DROP: Dimensionality Reduction Optimization for Time Series

    Full text link
    Dimensionality reduction is a critical step in scaling machine learning pipelines. Principal component analysis (PCA) is a standard tool for dimensionality reduction, but performing PCA over a full dataset can be prohibitively expensive. As a result, theoretical work has studied the effectiveness of iterative, stochastic PCA methods that operate over data samples. However, termination conditions for stochastic PCA either execute for a predetermined number of iterations, or until convergence of the solution, frequently sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after operating over small (e.g., 1%) subsamples of input data, reducing whole workload runtime. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5x over Singular-Value-Decomposition-based PCA techniques, and exceeds conventional approaches like FFT and PAA by up to 16x in end-to-end workloads

    Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

    Get PDF
    Research on similarity join techniques is becoming one of the growing practical areas for study, especially with the increasing E-availability of vast amounts of digital data from more and more source systems. This research is focused on pre-processing clustering-based techniques to improve existing similarity join approaches. Identifying and extracting the same real-world entities from different data sources is still a big challenge and a significant task in the digital information era. Dissimilar extracts may indeed represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Therefore discovering efficient and accurate approaches to determine the similarity of data objects or values is of theoretical as well as practical significance. Semantic problems are raised even on the concept of similarity regarding its usage and foundation. Existing similarity join approaches often have a very specific view of similarity measures and pre-defined predicates that represent a narrow focus on the context of similarity for a given scenario. The predicates have been assumed to be a group of clustering [MSW 72] related attributes on the join. To identify those entities for data integration purposes requires a broader view of similarity; for instance a number of generic similarity measures are useful in a given data integration systems. This study focused on string similarity join, namely based on the Levenshtein or edit distance and Q-gram. Proposed effective and efficient pre-processing clustering-based techniques were the focus of this study to identify clustering related predicates based on either attribute value or data value that improve existing similarity join techniques in enterprise data integration scenarios

    Application of Multi-GNSS Positioning in Landslide Surface Deformation Monitoring

    Get PDF
    With a modernization of legacy GPS and GLONASS systems, as well as with a finalization of the new European Galileo and Chinese BeiDou systems, about 120 navigation satellites for Global Navigation Satellite System (GNSS) users around the world are available presently. Usage of multi-GNSS constellations has therefore become an important research topic in recent years, including the area of landslide monitoring. The main goal of this dissertation thesis was to analyze and study positioning accuracy and performance of different satellite systems combinations with focus on finding the optimal strategy for multi-GNSS data collection and processing in landslide monitoring applications. Five stabilized monitoring points allowing repetitive GNSS observation campaigns were established at the selected Recica landslide in the Czech Republic. Quality of current multi-GNSS precise products provided by different analysis centers (ACs) was evaluated to allow a selection of the optimal one. Although no substantial differences were found, products provided by GeoForschungsZentrum (GFZ) and Center for Orbit Determination in Europe (CODE) can be recommended in overall. Consequently, positioning accuracy provided by various constellation combinations was analyzed by using data from well-established GNSS reference stations while simulating observation conditions of the Recica landslide. The best results were obtained when processing signals from a combination of GPS and GLONASS, or GPS, GLONASS and Galileo systems, with a static relative differential technique and observation periods for data collection exceeding eight hours. Finally, data from GNSS repetitive campaigns realized at the Recica landslide during two years were processed with optimal setup and obtained displacement results were compared to standard geotechnical measurements. A horizontal displacement with an annual velocity of about 3 cm in the horizontal direction was found for three monitoring points while the other two points were more stable.With a modernization of legacy GPS and GLONASS systems, as well as with a finalization of the new European Galileo and Chinese BeiDou systems, about 120 navigation satellites for Global Navigation Satellite System (GNSS) users around the world are available presently. Usage of multi-GNSS constellations has therefore become an important research topic in recent years, including the area of landslide monitoring. The main goal of this dissertation thesis was to analyze and study positioning accuracy and performance of different satellite systems combinations with focus on finding the optimal strategy for multi-GNSS data collection and processing in landslide monitoring applications. Five stabilized monitoring points allowing repetitive GNSS observation campaigns were established at the selected Recica landslide in the Czech Republic. Quality of current multi-GNSS precise products provided by different analysis centers (ACs) was evaluated to allow a selection of the optimal one. Although no substantial differences were found, products provided by GeoForschungsZentrum (GFZ) and Center for Orbit Determination in Europe (CODE) can be recommended in overall. Consequently, positioning accuracy provided by various constellation combinations was analyzed by using data from well-established GNSS reference stations while simulating observation conditions of the Recica landslide. The best results were obtained when processing signals from a combination of GPS and GLONASS, or GPS, GLONASS and Galileo systems, with a static relative differential technique and observation periods for data collection exceeding eight hours. Finally, data from GNSS repetitive campaigns realized at the Recica landslide during two years were processed with optimal setup and obtained displacement results were compared to standard geotechnical measurements. A horizontal displacement with an annual velocity of about 3 cm in the horizontal direction was found for three monitoring points while the other two points were more stable.548 - Katedra geoinformatikyvyhově
    corecore