14 research outputs found

    Distance-Based Independence Screening for Canonical Analysis

    Full text link
    This paper introduces a new method named Distance-based Independence Screening for Canonical Analysis (DISCA) to reduce dimensions of two random vectors with arbitrary dimensions. The objective of our method is to identify the low dimensional linear projections of two random vectors, such that any dimension reduction based on linear projection with lower dimensions will surely affect some dependent structure -- the removed components are not independent. The essence of DISCA is to use the distance correlation to eliminate the "redundant" dimensions until infeasible. Unlike the existing canonical analysis methods, DISCA does not require the dimensions of the reduced subspaces of the two random vectors to be equal, nor does it require certain distributional assumption on the random vectors. We show that under mild conditions, our approach does undercover the lowest possible linear dependency structures between two random vectors, and our conditions are weaker than some sufficient linear subspace-based methods. Numerically, DISCA is to solve a non-convex optimization problem. We formulate it as a difference-of-convex (DC) optimization problem, and then further adopt the alternating direction method of multipliers (ADMM) on the convex step of the DC algorithms to parallelize/accelerate the computation. Some sufficient linear subspace-based methods use potentially numerically-intensive bootstrap method to determine the dimensions of the reduced subspaces in advance; our method avoids this complexity. In simulations, we present cases that DISCA can solve effectively, while other methods cannot. In both the simulation studies and real data cases, when the other state-of-the-art dimension reduction methods are applicable, we observe that DISCA performs either comparably or better than most of them. Codes and an R package can be found in GitHub https://github.com/ChuanpingYu/DISCA

    An Overview of Computational Approaches for Interpretation Analysis

    Get PDF
    It is said that beauty is in the eye of the beholder. But how exactly can we characterize such discrepancies in interpretation? For example, are there any specific features of an image that makes person A regard an image as beautiful while person B finds the same image displeasing? Such questions ultimately aim at explaining our individual ways of interpretation, an intention that has been of fundamental importance to the social sciences from the beginning. More recently, advances in computer science brought up two related questions: First, can computational tools be adopted for analyzing ways of interpretation? Second, what if the "beholder" is a computer model, i.e., how can we explain a computer model's point of view? Numerous efforts have been made regarding both of these points, while many existing approaches focus on particular aspects and are still rather separate. With this paper, in order to connect these approaches we introduce a theoretical framework for analyzing interpretation, which is applicable to interpretation of both human beings and computer models. We give an overview of relevant computational approaches from various fields, and discuss the most common and promising application areas. The focus of this paper lies on interpretation of text and image data, while many of the presented approaches are applicable to other types of data as well.Comment: Preprint submitted to Digital Signal Processin

    Statistics, Computation & Applications

    Get PDF
    When statistics meets real applications, the computational aspect of the statistical methods becomes critical. In this dissertation, I try to improve the computational efficiency of some statistical methods, so that they become both computationally and statistically optimal. Inspired by the recent development of the distance-based methods in statistics, I first propose a novel distance-based canonical analysis method. Secondly, an efficient algorithm of calculating distance-based statistics is studied. Moreover, a new semidefinite programming algorithm is also developed for the applications in power flow analysis problems; it appears to be more robust than existing methods. I give more details in the following. In the first part of this dissertation, we introduce a novel dimension reduction method called distance-based independence screening for canonical analysis (DISCA), which can be used to reduce dimensions of two random vectors with arbitrary dimensions. The essence of our method -- DISCA -- is to use the distance-based independence measure -- distance correlation, which was proposed by SzĂ©kely and Rizzo in 2007 -- to eliminate the “redundant” dimensions until infeasible. Numerically, DISCA is to solve a non-convex optimization problem. Algorithms and theoretical justifications are provided, and the comparisons with other existing methods demonstrate its accuracy, universality, and effectiveness. An R package DISCA can be found on GitHub. Noticing that distance correlation used in DISCA is computationally expensive with the increase of space dimensions, in the second part of this dissertation, we manage to accelerate the calculation of distance-based statistics, by projecting multidimensional variables onto pre-specified projection directions, with the improvement of computational complexity from O(m∙m) to O(nm∙log⁥(m)), where n is the number of projection directions and m is the sample size. Computational savings are achieved when nâ‰Șm/log⁥(m). The optimal pre-specified projection directions can be obtained by minimizing the worse-case difference between the true distance and the approximated distance. We provide solutions and greedy algorithms for different scenarios, and confirm the advantage of our technique in comparison with the pure Monte Carlo approach, in which the directions are randomly selected rather than pre-calculated. In the third part of this dissertation, we turn our focus on the applications of statistical computational algorithms in power systems area. A new semidefinite programming algorithm is proposed to solve the power flow and power system state estimation problems. Both two kinds of problems are non-convex, and convex relaxation is the typical approach to non-convexity in power systems area, while the objective functions are required to be carefully designed in order to keep the equivalency before and after relaxation. We first reformulate the two types of complex-valued problems as non-convex real-valued ones. We show that an alternating semidefinite programming algorithm can be applied and is not sensitive to the start point without the sacrifices of accuracy. Furthermore, it performs well even when the voltage angles are not close to zero. Convergence analysis is provided, and numerical studies on representative power systems datasets demonstrate the accuracy of our proposed algorithm, and applicability on various scenarios of different given measurements.Ph.D

    Modeling of mutual dependencies

    Get PDF
    Data analysis means applying computational models to analyzing large collections of data, such as video signals, text collections, or measurements of gene activities in human cells. Unsupervised or exploratory data analysis refers to a subtask of data analysis, in which the goal is to find novel knowledge based on only the data. A central challenge in unsupervised data analysis is separating relevant and irrelevant information from each other. In this thesis, novel solutions to focusing on more relevant findings are presented. Measurement noise is one source of irrelevant information. If we have several measurements of the same objects, the noise can be suppressed by averaging over the measurements. Simple averaging is, however, only possible when the measurements share a common representation. In this thesis, we show how irrelevant information can be suppressed or ignored also in cases where the measurements come from different kinds of sensors or sources, such as video and audio recordings of the same scene. For combining the measurements, we use mutual dependencies between them. Measures of dependency, such as mutual information, characterize commonalities between two sets of measurements. Two measurements can hence be combined to reduce irrelevant variation by finding new representations for the objects so that the representations are maximally dependent. The combination is optimal, given the assumption that what is in common between the measurements is more relevant than information specific to any one of the sources. Several practical models for the task are introduced. In particular, novel Bayesian generative models, including a Bayesian version of the classical method of canonical correlation analysis, are given. Bayesian modeling is especially justified approach to learning from small data sets. Hence, generative models can be used to extract dependencies in a more reliable manner in, for example, medical applications, where obtaining a large number of samples is difficult. Also, novel non-Bayesian models are presented: Dependent component analysis finds linear projections which capture more general dependencies than earlier methods. Mutual dependencies can also be used for supervising traditional unsupervised learning methods. The learning metrics principle describes how a new distance metric focusing on relevant information can be derived based on the dependency between the measurements and a supervising signal. In this thesis, the approximations and optimization methods required for using the learning metrics principle are improved
    corecore