Statistical Methods for Clustering and High Dimensional Time Series Analysis

Abstract

Thesis (Ph.D.)--University of Washington, 2022This dissertation mainly explores two statistical tasks, namely clustering and analysis of high-dimensional time series. Clustering, a very important unsupervised learning problem, studies the structure of unlabeled datasets. The goal of clustering is to partition the data points into subsets such that data points in the same subset are similar and different from those in other subsets. Mode-clustering is a clustering analysis method that partitions the data into groups by the local modes of the underlying density function. Sometimes, finding clusters is not the ultimate goal. The connectivity among clusters may yield valuable information for scientists. This dissertation presents a new clustering method inspired by mode-clustering that not only finds clusters but also assigns each cluster with an attribute label. Clusters obtained from our method show connectivity of the underlying distribution. We also design a local two-sample test based on the clustering result that has more power than a conventional method. We apply our method to the Astronomy and GvHD data and show that our method finds meaningful clusters. In addition, we derive the statistical and computational theory of our method. Motivated by the challenges of modeling time series data sets that exhibit non-linear patterns, especially in high dimensions, this dissertation also considers the threshold Auto-Regressive (TAR) process. The TAR process provides a family of non-linear auto-regressive time series models in which the process dynamics are specific step functions of a thresholding variable. While estimation and inference for low-dimensional TAR models have been investigated, high-dimensional TAR models have received less attention. In this dissertation, we develop a new framework for estimating high-dimensional TAR models and propose two different sparsity-inducing penalties. The first penalty corresponds to a natural extension of the classical TAR model to high-dimensional settings, where the same threshold is enforced for all model parameters. Our second penalty develops a more flexible TAR model, where different thresholds are allowed for different auto-regressive coefficients. We show that both penalized estimation strategies can be utilized in a three-step procedure that consistently learns both the thresholds and the corresponding auto-regressive coefficients. However, our theoretical and empirical investigations show that the direct extension of the TAR model is not appropriate for high-dimensional settings and is better suited for moderate dimensions. In contrast, the more flexible extension of the TAR model leads to consistent estimation and superior empirical performance in high dimensions. In addition to the three-step procedure, the dynamic programming approach can successfully handle high dimensions with diverging number of thresholds as well. In particular, extensive numerical analysis and theoretical results demonstrate the advantages of the dynamic programming approach. Finally, we also discuss a method to select the optimal thresholding variable automatically

    Similar works

    Full text

    thumbnail-image