192 research outputs found

    Fast and Accurate Dual-Way Streaming PARAFAC2 for Irregular Tensors -- Algorithm and Application

    Full text link
    How can we efficiently and accurately analyze an irregular tensor in a dual-way streaming setting where the sizes of two dimensions of the tensor increase over time? What types of anomalies are there in the dual-way streaming setting? An irregular tensor is a collection of matrices whose column lengths are the same while their row lengths are different. In a dual-way streaming setting, both new rows of existing matrices and new matrices arrive over time. PARAFAC2 decomposition is a crucial tool for analyzing irregular tensors. Although real-time analysis is necessary in the dual-way streaming, static PARAFAC2 decomposition methods fail to efficiently work in this setting since they perform PARAFAC2 decomposition for accumulated tensors whenever new data arrive. Existing streaming PARAFAC2 decomposition methods work in a limited setting and fail to handle new rows of matrices efficiently. In this paper, we propose Dash, an efficient and accurate PARAFAC2 decomposition method working in the dual-way streaming setting. When new data are given, Dash efficiently performs PARAFAC2 decomposition by carefully dividing the terms related to old and new data and avoiding naive computations involved with old data. Furthermore, applying a forgetting factor makes Dash follow recent movements. Extensive experiments show that Dash achieves up to 14.0x faster speed than existing PARAFAC2 decomposition methods for newly arrived data. We also provide discoveries for detecting anomalies in real-world datasets, including Subprime Mortgage Crisis and COVID-19.Comment: 12 pages, accept to The 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 202

    A hybrid method to select morphometric features using tensor completion and F-score rank for gifted children identification

    Get PDF
    Gifted children are able to learn in a more advanced way than others, probably due to neurophysiological differences in the communication efficiency in neural pathways. Topological features contribute to understanding the correlation between the brain structure and intelligence. Despite decades of neuroscience research using MRI, methods based on brain region connectivity patterns are limited by MRI artifacts, which therefore leads to revisiting MRI morphometric features, with the aim of using them to directly identify gifted children instead of using brain connectivity. However, the small, high dimensional morphometric feature dataset with outliers makes the task of finding good classification models challenging. To this end, a hybrid method is proposed that combines tensor completion and feature selection methods to handle outliers and then select the discriminative features. The proposed method can achieve a classification accuracy of 93.1%, higher than other existing algorithms, which is thus suitable for the small MRI datasets with outliers in supervised classification scenarios.Fil: Zhang, Jin. Nankai University; ChinaFil: Feng, Fan. Nankai University; ChinaFil: Han, TianYi. Nankai University; ChinaFil: Duan, Feng. Nankai University; ChinaFil: Sun, Zhe. Riken. Brain Science Institute; JapónFil: Caiafa, César Federico. Provincia de Buenos Aires. Gobernación. Comisión de Investigaciones Científicas. Instituto Argentino de Radioastronomía. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto Argentino de Radioastronomía; ArgentinaFil: Solé Casals, Jordi. Central University of Catalonia; Españ

    Streaming data recovery via Bayesian tensor train decomposition

    Full text link
    In this paper, we study a Bayesian tensor train (TT) decomposition method to recover streaming data by approximating the latent structure in high-order streaming data. Drawing on the streaming variational Bayes method, we introduce the TT format into Bayesian tensor decomposition methods for streaming data, and formulate posteriors of TT cores. Thanks to the Bayesian framework of the TT format, the proposed algorithm (SPTT) excels in recovering streaming data with high-order, incomplete, and noisy properties. The experiments in synthetic and real-world datasets show the accuracy of our method compared to state-of-the-art Bayesian tensor decomposition methods for streaming data

    Theory and Algorithms for Reliable Multimodal Data Analysis, Machine Learning, and Signal Processing

    Get PDF
    Modern engineering systems collect large volumes of data measurements across diverse sensing modalities. These measurements can naturally be arranged in higher-order arrays of scalars which are commonly referred to as tensors. Tucker decomposition (TD) is a standard method for tensor analysis with applications in diverse fields of science and engineering. Despite its success, TD exhibits severe sensitivity against outliers —i.e., heavily corrupted entries that appear sporadically in modern datasets. We study L1-norm TD (L1-TD), a reformulation of TD that promotes robustness. For 3-way tensors, we show, for the first time, that L1-TD admits an exact solution via combinatorial optimization and present algorithms for its solution. We propose two novel algorithmic frameworks for approximating the exact solution to L1-TD, for general N-way tensors. We propose a novel algorithm for dynamic L1-TD —i.e., efficient and joint analysis of streaming tensors. Principal-Component Analysis (PCA) (a special case of TD) is also outlier responsive. We consider Lp-quasinorm PCA (Lp-PCA) for

    Randomized Algorithms for Computation of Tucker decomposition and Higher Order SVD (HOSVD)

    Full text link
    Big data analysis has become a crucial part of new emerging technologies such as the internet of things, cyber-physical analysis, deep learning, anomaly detection, etc. Among many other techniques, dimensionality reduction plays a key role in such analyses and facilitates feature selection and feature extraction. Randomized algorithms are efficient tools for handling big data tensors. They accelerate decomposing large-scale data tensors by reducing the computational complexity of deterministic algorithms and the communication among different levels of the memory hierarchy, which is the main bottleneck in modern computing environments and architectures. In this paper, we review recent advances in randomization for the computation of Tucker decomposition and Higher Order SVD (HOSVD). We discuss random projection and sampling approaches, single-pass, and multi-pass randomized algorithms, and how to utilize them in the computation of the Tucker decomposition and the HOSVD. Simulations on synthetic and real datasets are provided to compare the performance of some of the best and most promising algorithms

    Similarity learning in the era of big data

    Get PDF
    This dissertation studies the problem of similarity learning in the era of big data with heavy emphasis on real-world applications in social media. As in the saying “birds of a feather flock together,” in similarity learning, we aim to identify the notion of being similar in a data-driven and task-specific way, which is a central problem for maximizing the value of big data. Despite many successes of similarity learning from past decades, social media networks as one of the most typical big data media contain large-volume, various and high-velocity data, which makes conventional learning paradigms and off- the-shelf algorithms insufficient. Thus, we focus on addressing the emerging challenges brought by the inherent “three-Vs” characteristics of big data by answering the following questions: 1) Similarity is characterized by both links and node contents in networks; how to identify the contribution of each network component to seamlessly construct an application orientated similarity function? 2) Social media data are massive and contain much noise; how to efficiently learn the similarity between node pairs in large and noisy environments? 3) Node contents in social media networks are multi-modal; how to effectively measure cross-modal similarity by bridging the so-called “semantic gap”? 4) User wants and needs, and item characteristics, are continuously evolving, which generates data at an unprecedented rate; how to model the nature of temporal dynamics in principle and provide timely decision makings? The goal of this dissertation is to provide solutions to these questions via innovative research and novel methods. We hope this dissertation sheds more light on similarity learning in the big data era and broadens its applications in social media
    • …
    corecore