32 research outputs found

    PROBABILISTIC AND GEOMETRIC APPROACHES TO THE ANALYSIS OF NON-STANDARD DATA

    Get PDF
    This dissertation explores topics in machine learning, network analysis, and the foundations of statistics using tools from geometry, probability and optimization. The rise of machine learning has brought powerful new (and old) algorithms for data analysis. Much of classical statistics research is about understanding how statistical algorithms behave depending on various aspects of the data. The first part of this dissertation examines the support vector machine classifier (SVM). Leveraging Karush-Kuhn-Tucker conditions we find surprising connections between SVM and several other simple classifiers. We use these connections to explain SVM’s behavior in a variety of data scenarios and demonstrate how these insights are directly relevant to the data analyst. The next part of this dissertation studies networks which evolve over time. We first develop a method to empirically evaluate vertex centrality metrics in an evolving network. We then apply this methodology to investigate the role of precedent in the US legal system. Next, we shift to a probabilistic perspective on temporally evolving networks. We study a general probabilistic model of an evolving network that undergoes an abrupt change in its evolution dynamics. In particular, we examine the effect of such a change on the network’s structural properties. We develop mathematical techniques using continuous time branching processes to derive quantitative error bounds for functionals of a major class of these models about their large network limits. Using these results, we develop general theory to understand the role of abrupt changes in the evolution dynamics of these models. Based on this theory we derive a consistent, non-parametric change point detection estimator. We conclude with a discussion on foundational topics in statistics, commenting on debates both old and new. First, we examine the false confidence theorem which raises questions for data practitioners making inferences based on epistemic uncertainty measures such as Bayesian posterior distributions. Second, we give an overview of the rise of “data science" and what it means for statistics (and vice versa), touching on topics such as reproducibility, computation, education, communication and statistical theory.Doctor of Philosoph

    Joint and individual analysis of breast cancer histologic images and genomic covariates

    Get PDF
    A key challenge in modern data analysis is understanding connections between complex and differing modalities of data. For example, two of the main approaches to the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genetics. While histopathology is the gold standard for diagnostics and there have been many recent breakthroughs in genetics, there is little overlap between these two fields. We aim to bridge this gap by developing methods based on Angle-based Joint and Individual Variation Explained (AJIVE) to directly explore similarities and differences between these two modalities. Our approach exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction to address some of the challenges presented by statistical analysis of histopathology image data. CNNs raise issues of interpretability that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features. Our results provide many interpretable connections and contrasts between histopathology and genetics
    corecore