18 research outputs found

    cglasso: An R Package for Conditional Graphical Lasso Inference with Censored and Missing Values

    Get PDF
    Sparse graphical models have revolutionized multivariate inference. With the advent of high-dimensional multivariate data in many applied fields, these methods are able to detect a much lower-dimensional structure, often represented via a sparse conditional independence graph. There have been numerous extensions of such methods in the past decade. Many practical applications have additional covariates or suffer from missing or censored data. Despite the development of these extensions of sparse inference methods for graphical models, there have been so far no implementations for, e.g., conditional graphical models. Here we present the general-purpose package cglasso for estimating sparse conditional Gaussian graphical models with potentially missing or censored data. The method employs an efficient expectation-maximization estimation of an â„“1 -penalized likelihood via a block-coordinate descent algorithm. The package has a user-friendly data manipulation interface. It estimates a solution path and includes various automatic selection algorithms for the two â„“1 tuning parameters, associated with the sparse precision matrix and sparse regression coefficients, respectively. The package pays particular attention to the visualization of the results, both by means of marginal tables and figures, and of the inferred conditional independence graphs. This package provides a unique and computational efficient implementation of a conditional Gaussian graphical model that is able to deal with the additional complications of missing and censored data. As such it constitutes an important contribution for empirical scientists wishing to detect sparse structures in high-dimensional data

    High-dimensional variable selection for GLMs and survival models

    Get PDF

    High-dimensional variable selection for GLMs and survival models

    Get PDF

    Interpretable machine learning models for predicting with missing values

    Get PDF
    Machine learning models are often used in situations where model inputs are missing either during training or at the time of prediction. If missing values are not handled appropriately, they can lead to increased bias or to models that are not applicable in practice without imputing the values of the unobserved variables. However, the imputation of missing values is often inadequate and difficult to interpret for complex imputation functions. In this thesis, we focus on predictions in the presence of incomplete data at test time, using interpretable models that allow humans to understand the predictions. Interpretability is especially necessary when important decisions are at stake, such as in healthcare. First, we investigate, the situation where variables are missing in recurrent patterns and sample sizes are small per pattern. We propose SPSM that allows coefficient sharing between a main model and pattern submodels in order to make efficient use of data and to be independent on imputation. To enable interpretability, the model can be expressed as a short description introduced by sparsity. Then, we explore situations where missingness does not occur in patterns and suggest the sparse linear rule model MINTY that naturally trades off between interpretability and the goodness of fit while being sensitive to missing values at test time. To this end, we learn replacement variables, indicating which features in a rule can be alternatively used when the original feature was not measured, assuming some redundancy in the covariates. Our results have shown that the proposed interpretable models can be used for prediction with missing values, without depending on imputation. We conclude that more work can be done in evaluating interpretable machine learning models in the context of missing values at test time

    Bayesian Learning in the Counterfactual World

    Get PDF
    Recent years have witnessed a surging interest towards the use of machine learning tools for causal inference. In contrast to the usual large data settings where the primary goal is prediction, many disciplines, such as health, economic and social sciences, are instead interested in causal questions. Learning individualized responses to an intervention is a crucial task in many applied fields (e.g., precision medicine, targeted advertising, precision agriculture, etc.) where the ultimate goal is to design optimal and highly-personalized policies based on individual features. In this work, I thus tackle the problem of estimating causal effects of an intervention that are heterogeneous across a population of interest and depend on an individual set of characteristics (e.g., a patient's clinical record, user's browsing history, etc..) in high-dimensional observational data settings. This is done by utilizing Bayesian Nonparametric or Probabilistic Machine Learning tools that are specifically adjusted for the causal setting and have desirable uncertainty quantification properties, with a focus on the issues of interpretability/explainability and inclusion of domain experts' prior knowledge. I begin by introducing terminology and concepts from causality and causal reasoning in the first chapter. Then I include a literature review of some of the state-of-the-art regression-based methods for heterogeneous treatment effects estimation, with an attempt to build a unifying taxonomy and lay down the finite-sample empirical properties of these models. The chapters forming the core of the dissertation instead present some novel methods addressing existing issues in individualized causal effects estimation: Chapter 3 develops both a Bayesian tree ensemble method and a deep learning architecture to tackle interpretability, uncertainty coverage and targeted regularization; Chapter 4 instead introduces a novel multi-task Deep Kernel Learning method particularly suited for multi-outcome | multi-action scenarios. The last chapter concludes with a discussion

    Quantile regression methods in economics and finance

    Get PDF
    In the recent years, quantile regression methods have attracted relevant interest in the statistical and econometric literature. This phenomenon is due to the advantages arising from the quantile regression approach, mainly the robustness of the results and the possibility to analyse several quantiles of a given random variable. Such as features are particularly appealing in the context of economic and financial data, where extreme events assume critical importance. The present thesis is based on quantile regression, with focus on the economic and financial environment. First of all, we propose new approaches in developing asset allocation strategies on the basis of quantile regression and regularization techniques. It is well known that quantile regression model minimizes the portfolio extreme risk, whenever the attention is placed on the estimation of the response variable left quantiles. We show that, by considering the entire conditional distribution of the dependent variable, it is possible to optimize different risk and performance indicators. In particular, we introduce a risk-adjusted profitability measure, useful in evaluating financial portfolios under a pessimistic perspective, since the reward contribution is net of the most favorable outcomes. Moreover, as we consider large portfolios, we also cope with the dimensionality issue by introducing an l1-norm penalty on the assets weights. Secondly, we focus on the determinants of equity risk and their forecasting implications. Several market and macro-level variables influence the evolution of equity risk in addition to the well-known volatility persistence. However, the impact of those covariates might change depending on the risk level, being different between low and high volatility states. By combining equity risk estimates, obtained from the Realized Range Volatility, corrected for microstructure noise and jumps, and quantile regression methods, we evaluate, in a forecasting perspective, the impact of the equity risk determinants in different volatility states and, without distributional assumptions on the realized range innovations, we recover both the points and the conditional distribution forecasts. In addition, we analyse how the relationships among the involved variables evolve over time, through a rolling window procedure. The results show evidence of the selected variables' relevant impacts and, particularly during periods of market stress, highlight heterogeneous effects across quantiles. Finally, we study the dynamic impact of uncertainty in causing and forecasting the distribution of oil returns and risk. We analyse the relevance of recently developed news-based measures of economic policy uncertainty and equity market uncertainty in causing and predicting the conditional quantiles and distribution of the crude oil variations, defined both as returns and squared returns. For this purpose, on the one hand, we study the causality relations in quantiles through a non-parametric testing method; on the other hand, we forecast the conditional distribution on the basis of the quantile regression approach and the predictive accuracy is evaluated by means of several suitable tests. Given the presence of structural breaks over time, we implement a rolling window procedure to capture the dynamic relations among the variables

    A Study on Shape Clustering and Odor Prediction

    Get PDF
    RÉSUMÉ : La thèse est divisée en deux parties principales. Dans la première partie, nous développons une nouvelle méthodologie et les outils computationnels nécessaires pour le regroupement ("clustering" en anglais) de formes. Dans la deuxième partie, nous abordons la problématique de prédiction des odeurs dans le secteur de la technologie du "nez électrique" ("e-nose" en anglais). Les chapitres 1 et 2 décrivent nos méthodologies proposées pour le regroupement de formes. Dans le chapitre 3, nous présentons une nouvelle approche pour la qualité des prédictions d’odeurs. Ensuite, nous exhibons un bref aperçu des deux problématiques, c’est-à-dire, le regroupement de formes et la prédiction d’odeurs, et nos solutions proposées. 1- Regroupement de formes: Les formes peuvent être interprétées comme des contours fermés dans un espace dimensionnel infini qui peut se transformer en différentes formes à travers le temps. Le principal objectif dans la modélisation de formes est de fournir un modèle mathématique qui représente chacune des formes. L’analyse statistique de formes est un outil très puissant dans l’étude des structures anatomiques des images médicales. Dans cette thése, qui est motivée principalement par les applications biologiques, nous suggérons une méthodologie pour la modélisation de surfaces des cellules. De plus, nous proposons une nouvelle technique de regroupement de formes de cellules. La méthodologie peut également être appliquée à d’autres objets géométriques. De nombreuses études ont été menées afin de suivre les possibles déformations des cellules à travers des descriptions qualitatives. Notre intérêt est plutôt de fournir une évaluation numérique précise des cellules. Dans le chapitre 1, des modèles statistiques utilisant différentes fonctions de base ("basis function" en anglais) sont ajustés afin de modéliser la surface des formes des cellules en 2 et 3 dimensions. Pour ce faire, la surface d'une cellule est d'abord convertie en un ensemble de données numériques. Par la suite, une courbe est ajustée à ces données. À ce stade, chaque cellule est représentée par une fonction continue. Maintenant, la question fondamentale est: comment distinguer différentes cellules en utilisant leurs formes fonctionnelles? Dans le chapitre 2, nous formulons un critère d'information bayésienne de regroupement ("clustering Bayesian information criterion" ou CLUSBIC en anglais) pour le regroupement hiérarchique de formes. Dans cette nouvelle approche, nous traitons les formes comme des courbes continues et nous calculons la fonction marginale postérieure associée à chaque courbe. Par conséquent, nous construisons le dendrogramme pour le regroupement hiérarchique en utilisant le CLUSBIC. Le dendrogramme est coupé lorsque la fonction marginale postérieure atteint son maximum. Nous montrons au chapitre 2 que le CLUSBIC est une extension naturelle de la méthode de Ward, une mesure de regroupement bien connue. Comme le critère d'information bayésien (BIC) dans le cadre d'une régression, nous démontrons la cohérence du CLUSBIC dans le cadre du regroupement de données. Le CLUSBIC est une extension du BIC, qui coïncide avec le BIC si les données se regroupent dans un amas unique. L'utilité de notre méthodologie proposée dans la modélisation et le regroupement des formes est étudiée sur des données simulées ainsi que sur des données réelles. 2- Prédiction d'odeurs: Un "e-nose", ou olfaction artificielle, est un dispositif qui analyse l'air afin d'identifier les odeurs en utilisant un ensemble de capteurs de gaz. Le "e-nose" produit des données multidimensionnelles pour chaque mesure qu'il saisit du milieu environnant. Un petit sous-échantillon de ces mesures est envoyé à l'olfactométrie où les activités d'odeurs sont analysées. Dans l'olfactométrie, par exemple, on attribue à chaque mesure du "e-nose" une valeur de concentration d'odeurs qui décrit l'identification des odeurs par les humains. Le processus de transfert des mesures à l'olfactométrie et l'analyse de leur concentration d'odeurs sont longs et coûteux. Ainsi, des méthodes de reconnaissance de formes ont été appliquées aux données du nez électronique pour la prévision automatique de la concentration d'odeurs. Il est essentiel d'évaluer la validité des mesures en raison de la sensibilité du "e-nose" aux changements environnementaux et physiques. Les mesures imprécises conduisent à des résultats de reconnaissance de formes peu fiables. Par conséquent, la vérification des échantillons de données provenant du nez électronique et la prise de mesures nécessaires en présence d'anomalies sont essentielles. Nous créons une variante améliorée du "e-nose" existant qui est capable d'évaluer automatiquement et en ligne la validité des échantillons et de prédire l'odeur en utilisant des méthodes appropriées de reconnaissance de formes.----------ABSTRACT : This thesis is divided into two main parts. In the first part, we develop a new methodology and the necessary computational tools for shape clustering. In the second part, we tackle the challenging problem of odor prediction in electronic nose (e-nose) technology. Chapter 1 and Chapter 2 describe our proposed methodology for shape clustering. In Chapter 3, we present a new approach for quality odor prediction. Following is a brief overview of the two problems, i.e. shape clustering and odor prediction, and our proposed solutions. 1- Shape Clustering: Shapes can be interpreted as closed contours in an infinite dimensional space which can morph into different shapes over time. The main goal in shape modeling is to provide a mathematical model to represent each shape. Statistical shape analysis is a powerful tool in studying the anatomical structures in medical images. In this thesis, motivated by biological applications, we suggest a methodology for surface modeling of cells. Furthermore, we propose a novel technique for clustering cell shapes. The methodology can be applied to other geometrical objects as well. Many studies have been conducted to track possible deformations of cells using qualitative descriptions. Our interest is rather providing an accurate numerical assessment of cells. In Chapter 1, statistical models using different basis functions are adapted for modeling the surface of cell shapes both in 2D and 3D spaces. To this end, the surface of a cell is first converted to a set of numerical data. Afterwards, a curve is fitted to these data. At this stage, each cell is represented by a continuous function. The fundamental question, now, is how to distinguish between different cells using their functional forms. In Chapter 2, we formulate a clustering Bayesian information criterion (CLUSBIC) for hierarchical clustering of shapes. In this new approach, we treat shapes as continuous curves and we compute the marginal probability associated with each curve. Accordingly, we build the dendrogram for hierarchical clustering employing CLUSBIC. The dendrogram is cut when the marginal probability reaches its maximum. We show that CLUSBIC is a natural extension of Ward's linkage, a well-known clustering measure, in Chapter 2. Similar to Bayesian information criterion (BIC) in regression setting, we demonstrate the consistency of CLUSBIC in clustering. CLUSBIC is an extension of BIC, which coincides with BIC if data fall into a single cluster. The usefulness of our proposed methodology in modeling and clustering shapes is examined on simulated and real data. 2- Odor Prediction: An e-nose, or artificial olfaction, is a device that analyzes the air to identify odors using an array of gas sensors. The e-nose produces multi-dimensional data for each measurement that it takes from the surrounding environment. A small sub-sample of these measurements are sent to the olfactometry where they are analyzed for odor activities. In olfactometry, for instance, each e-nose measurement is assigned an odor concentration value which describes the odor identifiability by humans. The process of transferring the measurements to the olfactometry and analyzing their odor concentration is time consuming and costly. For this purpose, pattern recognition methods have been applied to e-nose data for automatic prediction of the odor concentration. It is essential to assess the validity of the measurements due to the sensitivity of the e-nose to environmental and physical changes. The imprecise measurements lead to unreliable pattern recognition outcomes. Therefore, continuous monitoring of e-nose samples and taking necessary actions in the presence of anomalies is vital. We devise an improved variant of the existing e-nose which is capable of assessing the validity of samples automatically in an online manner, and predicting odor using suitable pattern recognition methods

    High-dimensional variable selection for GLMs and survival models

    Get PDF
    De focus van het proefschrift is op de statistische numerieke benaderingen om geringe genoomgegevens te passen met GLM- en overlevingsgegevens. Het proefschrift beschrijft de selectie van verklarende variabelen die een univariate uitkomst kunnen beïnvloeden. Het resultaat heeft een kansverdeling die valt in de klasse van de exponentiële dispersie familie. De aanpak die wordt onderzocht is de regressie van de differentiaalgeometrie van de minimale hoek (dgLARS) die is ontwikkeld voor genormaliseerde lineaire modellen. De dgLARS-aanpak wordt vergeleken met alternatieve methoden voor variabele selectie in algemene lineaire modellen. De numerieke procedures van dgLARS zijn verbeterd voor de algemene instelling, en wordt aangeduid als de uitgebreide dgLARS. Bovendien onderzoeken we hoe goed de dispersieparameter van de familie van exponentiële verdelingen kan worden geschat. In de tussentijd richten we ons op overlevingsgegevens en de genomische invloed, met behulp van de relatieve risicofunctie. In alle hoofdstukken blijkt dat de verbeterde en ontwikkelde numerieke procedures snel en accuraat zijn bij het schatten van parameters. Uiteindelijk wordt een volledige beschrijving van het pakket code{R} dat is ontwikkeld om alle analyses te doen, gepresenteerd

    Data Fusion and Systems Engineering Approaches for Quality and Performance Improvement of Health Care Systems: From Diagnosis to Care to System-level Decision-making

    Get PDF
    abstract: Technology advancements in diagnostic imaging, smart sensing, and health information systems have resulted in a data-rich environment in health care, which offers a great opportunity for Precision Medicine. The objective of my research is to develop data fusion and system informatics approaches for quality and performance improvement of health care. In my dissertation, I focus on three emerging problems in health care and develop novel statistical models and machine learning algorithms to tackle these problems from diagnosis to care to system-level decision-making. The first topic is diagnosis/subtyping of migraine to customize effective treatment to different subtypes of patients. Existing clinical definitions of subtypes use somewhat arbitrary boundaries primarily based on patient self-reported symptoms, which are subjective and error-prone. My research develops a novel Multimodality Factor Mixture Model that discovers subtypes of migraine from multimodality imaging MRI data, which provides complementary accurate measurements of the disease. Patients in the different subtypes show significantly different clinical characteristics of the disease. Treatment tailored and optimized for patients of the same subtype paves the road toward Precision Medicine. The second topic focuses on coordinated patient care. Care coordination between nurses and with other health care team members is important for providing high-quality and efficient care to patients. The recently developed Nurse Care Coordination Instrument (NCCI) is the first of its kind that enables large-scale quantitative data to be collected. My research develops a novel Multi-response Multi-level Model (M3) that enables transfer learning in NCCI data fusion. M3 identifies key factors that contribute to improving care coordination, and facilitates the design and optimization of nurses’ training, workload assignment, and practice environment, which leads to improved patient outcomes. The last topic is about system-level decision-making for Alzheimer’s disease early detection at the early stage of Mild Cognitive Impairment (MCI), by predicting each MCI patient’s risk of converting to AD using imaging and proteomic biomarkers. My research proposes a systems engineering approach that integrates the multi-perspectives, including prediction accuracy, biomarker cost/availability, patient heterogeneity and diagnostic efficiency, and allows for system-wide optimized decision regarding the biomarker testing process for prediction of MCI conversion.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201
    corecore