192 research outputs found
Fast and Accurate Dual-Way Streaming PARAFAC2 for Irregular Tensors -- Algorithm and Application
How can we efficiently and accurately analyze an irregular tensor in a
dual-way streaming setting where the sizes of two dimensions of the tensor
increase over time? What types of anomalies are there in the dual-way streaming
setting? An irregular tensor is a collection of matrices whose column lengths
are the same while their row lengths are different. In a dual-way streaming
setting, both new rows of existing matrices and new matrices arrive over time.
PARAFAC2 decomposition is a crucial tool for analyzing irregular tensors.
Although real-time analysis is necessary in the dual-way streaming, static
PARAFAC2 decomposition methods fail to efficiently work in this setting since
they perform PARAFAC2 decomposition for accumulated tensors whenever new data
arrive. Existing streaming PARAFAC2 decomposition methods work in a limited
setting and fail to handle new rows of matrices efficiently. In this paper, we
propose Dash, an efficient and accurate PARAFAC2 decomposition method working
in the dual-way streaming setting. When new data are given, Dash efficiently
performs PARAFAC2 decomposition by carefully dividing the terms related to old
and new data and avoiding naive computations involved with old data.
Furthermore, applying a forgetting factor makes Dash follow recent movements.
Extensive experiments show that Dash achieves up to 14.0x faster speed than
existing PARAFAC2 decomposition methods for newly arrived data. We also provide
discoveries for detecting anomalies in real-world datasets, including Subprime
Mortgage Crisis and COVID-19.Comment: 12 pages, accept to The 29th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD) 202
A hybrid method to select morphometric features using tensor completion and F-score rank for gifted children identification
Gifted children are able to learn in a more advanced way than others, probably due to neurophysiological differences in the communication efficiency in neural pathways. Topological features contribute to understanding the correlation between the brain structure and intelligence. Despite decades of neuroscience research using MRI, methods based on brain region connectivity patterns are limited by MRI artifacts, which therefore leads to revisiting MRI morphometric features, with the aim of using them to directly identify gifted children instead of using brain connectivity. However, the small, high dimensional morphometric feature dataset with outliers makes the task of finding good classification models challenging. To this end, a hybrid method is proposed that combines tensor completion and feature selection methods to handle outliers and then select the discriminative features. The proposed method can achieve a classification accuracy of 93.1%, higher than other existing algorithms, which is thus suitable for the small MRI datasets with outliers in supervised classification scenarios.Fil: Zhang, Jin. Nankai University; ChinaFil: Feng, Fan. Nankai University; ChinaFil: Han, TianYi. Nankai University; ChinaFil: Duan, Feng. Nankai University; ChinaFil: Sun, Zhe. Riken. Brain Science Institute; JapĂłnFil: Caiafa, CĂ©sar Federico. Provincia de Buenos Aires. GobernaciĂłn. ComisiĂłn de Investigaciones CientĂficas. Instituto Argentino de RadioastronomĂa. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - La Plata. Instituto Argentino de RadioastronomĂa; ArgentinaFil: SolĂ© Casals, Jordi. Central University of Catalonia; Españ
Streaming data recovery via Bayesian tensor train decomposition
In this paper, we study a Bayesian tensor train (TT) decomposition method to
recover streaming data by approximating the latent structure in high-order
streaming data. Drawing on the streaming variational Bayes method, we introduce
the TT format into Bayesian tensor decomposition methods for streaming data,
and formulate posteriors of TT cores. Thanks to the Bayesian framework of the
TT format, the proposed algorithm (SPTT) excels in recovering streaming data
with high-order, incomplete, and noisy properties. The experiments in synthetic
and real-world datasets show the accuracy of our method compared to
state-of-the-art Bayesian tensor decomposition methods for streaming data
Theory and Algorithms for Reliable Multimodal Data Analysis, Machine Learning, and Signal Processing
Modern engineering systems collect large volumes of data measurements across diverse sensing modalities. These measurements can naturally be arranged in higher-order arrays of scalars which are commonly referred to as tensors. Tucker decomposition (TD) is a standard method for tensor analysis with applications in diverse fields of science and engineering. Despite its success, TD exhibits severe sensitivity against outliers —i.e., heavily corrupted entries that appear sporadically in modern datasets. We study L1-norm TD (L1-TD), a reformulation of TD that promotes robustness. For 3-way tensors, we show, for the first time, that L1-TD admits an exact solution via combinatorial optimization and present algorithms for its solution. We propose two novel algorithmic frameworks for approximating the exact solution to L1-TD, for general N-way tensors. We propose a novel algorithm for dynamic L1-TD —i.e., efficient and joint analysis of streaming tensors. Principal-Component Analysis (PCA) (a special case of TD) is also outlier responsive. We consider Lp-quasinorm PCA (Lp-PCA) for
Randomized Algorithms for Computation of Tucker decomposition and Higher Order SVD (HOSVD)
Big data analysis has become a crucial part of new emerging technologies such
as the internet of things, cyber-physical analysis, deep learning, anomaly
detection, etc. Among many other techniques, dimensionality reduction plays a
key role in such analyses and facilitates feature selection and feature
extraction. Randomized algorithms are efficient tools for handling big data
tensors. They accelerate decomposing large-scale data tensors by reducing the
computational complexity of deterministic algorithms and the communication
among different levels of the memory hierarchy, which is the main bottleneck in
modern computing environments and architectures. In this paper, we review
recent advances in randomization for the computation of Tucker decomposition
and Higher Order SVD (HOSVD). We discuss random projection and sampling
approaches, single-pass, and multi-pass randomized algorithms, and how to
utilize them in the computation of the Tucker decomposition and the HOSVD.
Simulations on synthetic and real datasets are provided to compare the
performance of some of the best and most promising algorithms
Similarity learning in the era of big data
This dissertation studies the problem of similarity learning in the era of big data with heavy emphasis on real-world applications in social media. As in the saying “birds of a feather flock together,” in similarity learning, we aim to identify the notion of being similar in a data-driven and task-specific way, which is a central problem for maximizing the value of big data. Despite many successes of similarity learning from past decades, social media networks as one of the most typical big data media contain large-volume, various and high-velocity data, which makes conventional learning paradigms and off- the-shelf algorithms insufficient. Thus, we focus on addressing the emerging challenges brought by the inherent “three-Vs” characteristics of big data by answering the following questions: 1) Similarity is characterized by both links and node contents in networks; how to identify the contribution of each network component to seamlessly construct an application orientated similarity function? 2) Social media data are massive and contain much noise; how to efficiently learn the similarity between node pairs in large and noisy environments? 3) Node contents in social media networks are multi-modal; how to effectively measure cross-modal similarity by bridging the so-called “semantic gap”? 4) User wants and needs, and item characteristics, are continuously evolving, which generates data at an unprecedented rate; how to model the nature of temporal dynamics in principle and provide timely decision makings? The goal of this dissertation is to provide solutions to these questions via innovative research and novel methods. We hope this dissertation sheds more light on similarity learning in the big data era and broadens its applications in social media
- …