413 research outputs found

    Heterogeneous Sensor Signal Processing for Inference with Nonlinear Dependence

    Get PDF
    Inferring events of interest by fusing data from multiple heterogeneous sources has been an interesting and important topic in recent years. Several issues related to inference using heterogeneous data with complex and nonlinear dependence are investigated in this dissertation. We apply copula theory to characterize the dependence among heterogeneous data. In centralized detection, where sensor observations are available at the fusion center (FC), we study copula-based fusion. We design detection algorithms based on sample-wise copula selection and mixture of copulas model in different scenarios of the true dependence. The proposed approaches are theoretically justified and perform well when applied to fuse acoustic and seismic sensor data for personnel detection. Besides traditional sensors, the access to the massive amount of social media data provides a unique opportunity for extracting information about unfolding events. We further study how sensor networks and social media complement each other in facilitating the data-to-decision making process. We propose a copula-based joint characterization of multiple dependent time series from sensors and social media. As a proof-of-concept, this model is applied to the fusion of Google Trends (GT) data and stock/flu data for prediction, where the stock/flu data serves as a surrogate for sensor data. In energy constrained networks, local observations are compressed before they are transmitted to the FC. In these cases, conditional dependence and heterogeneity complicate the system design particularly. We consider the classification of discrete random signals in Wireless Sensor Networks (WSNs), where, for communication efficiency, only local decisions are transmitted. We derive the necessary conditions for the optimal decision rules at the sensors and the FC by introducing a hidden random variable. An iterative algorithm is designed to search for the optimal decision rules. Its convergence and asymptotical optimality are also proved. The performance of the proposed scheme is illustrated for the distributed Automatic Modulation Classification (AMC) problem. Censoring is another communication efficient strategy, in which sensors transmit only informative observations to the FC, and censor those deemed uninformative . We design the detectors that take into account the spatial dependence among observations. Fusion rules for censored data are proposed with continuous and discrete local messages, respectively. Their computationally efficient counterparts based on the key idea of injecting controlled noise at the FC before fusion are also investigated. In this thesis, with heterogeneous and dependent sensor observations, we consider not only inference in parallel frameworks but also the problem of collaborative inference where collaboration exists among local sensors. Each sensor forms coalition with other sensors and shares information within the coalition, to maximize its inference performance. The collaboration strategy is investigated under a communication constraint. To characterize the influence of inter-sensor dependence on inference performance and thus collaboration strategy, we quantify the gain and loss in forming a coalition by introducing the copula-based definitions of diversity gain and redundancy loss for both estimation and detection problems. A coalition formation game is proposed for the distributed inference problem, through which the information contained in the inter-sensor dependence is fully explored and utilized for improved inference performance

    Analysis and Optimization of Classifier Error Estimator Performance within a Bayesian Modeling Framework

    Get PDF
    With the advent of high-throughput genomic and proteomic technologies, in conjunction with the difficulty in obtaining even moderately sized samples, small-sample classifier design has become a major issue in the biological and medical communities. Training-data error estimation becomes mandatory, yet none of the popular error estimation techniques have been rigorously designed via statistical inference or optimization. In this investigation, we place classifier error estimation in a framework of minimum mean-square error (MMSE) signal estimation in the presence of uncertainty, where uncertainty is relative to a prior over a family of distributions. This results in a Bayesian approach to error estimation that is optimal and unbiased relative to the model. The prior addresses a trade-off between estimator robustness (modeling assumptions) and accuracy. Closed-form representations for Bayesian error estimators are provided for two important models: discrete classification with Dirichlet priors (the discrete model) and linear classification of Gaussian distributions with fixed, scaled identity or arbitrary covariances and conjugate priors (the Gaussian model). We examine robustness to false modeling assumptions and demonstrate that Bayesian error estimators perform especially well for moderate true errors. The Bayesian modeling framework facilitates both optimization and analysis. It naturally gives rise to a practical expected measure of performance for arbitrary error estimators: the sample-conditioned mean-square error (MSE). Closed-form expressions are provided for both Bayesian models. We examine the consistency of Bayesian error estimation and illustrate a salient application in censored sampling, where sample points are collected one at a time until the conditional MSE reaches a stopping criterion. We address practical considerations for gene-expression microarray data, including the suitability of the Gaussian model, a methodology for calibrating normal-inverse-Wishart priors from unused data, and an approximation method for non-linear classification. We observe superior performance on synthetic high-dimensional data and real data, especially for moderate to high expected true errors and small feature sizes. Finally, arbitrary error estimators may be optimally calibrated assuming a fixed Bayesian model, sample size, classification rule, and error estimation rule. Using a calibration function mapping error estimates to their optimally calibrated values off-line, error estimates may be calibrated on the fly whenever the assumptions apply

    Nearest Neighbor and Kernel Survival Analysis: Nonasymptotic Error Bounds and Strong Consistency Rates

    Full text link
    We establish the first nonasymptotic error bounds for Kaplan-Meier-based nearest neighbor and kernel survival probability estimators where feature vectors reside in metric spaces. Our bounds imply rates of strong consistency for these nonparametric estimators and, up to a log factor, match an existing lower bound for conditional CDF estimation. Our proof strategy also yields nonasymptotic guarantees for nearest neighbor and kernel variants of the Nelson-Aalen cumulative hazards estimator. We experimentally compare these methods on four datasets. We find that for the kernel survival estimator, a good choice of kernel is one learned using random survival forests.Comment: International Conference on Machine Learning (ICML 2019

    Vol. 13, No. 2 (Full Issue)

    Get PDF

    Data-Driven Modeling For Decision Support Systems And Treatment Management In Personalized Healthcare

    Get PDF
    Massive amount of electronic medical records (EMRs) accumulating from patients and populations motivates clinicians and data scientists to collaborate for the advanced analytics to create knowledge that is essential to address the extensive personalized insights needed for patients, clinicians, providers, scientists, and health policy makers. Learning from large and complicated data is using extensively in marketing and commercial enterprises to generate personalized recommendations. Recently the medical research community focuses to take the benefits of big data analytic approaches and moves to personalized (precision) medicine. So, it is a significant period in healthcare and medicine for transferring to a new paradigm. There is a noticeable opportunity to implement a learning health care system and data-driven healthcare to make better medical decisions, better personalized predictions; and more precise discovering of risk factors and their interactions. In this research we focus on data-driven approaches for personalized medicine. We propose a research framework which emphasizes on three main phases: 1) Predictive modeling, 2) Patient subgroup analysis and 3) Treatment recommendation. Our goal is to develop novel methods for each phase and apply them in real-world applications. In the fist phase, we develop a new predictive approach based on feature representation using deep feature learning and word embedding techniques. Our method uses different deep architectures (Stacked autoencoders, Deep belief network and Variational autoencoders) for feature representation in higher-level abstractions to obtain effective and more robust features from EMRs, and then build prediction models on the top of them. Our approach is particularly useful when the unlabeled data is abundant whereas labeled one is scarce. We investigate the performance of representation learning through a supervised approach. We perform our method on different small and large datasets. Finally we provide a comparative study and show that our predictive approach leads to better results in comparison with others. In the second phase, we propose a novel patient subgroup detection method, called Supervised Biclustring (SUBIC) using convex optimization and apply our approach to detect patient subgroups and prioritize risk factors for hypertension (HTN) in a vulnerable demographic subgroup (African-American). Our approach not only finds patient subgroups with guidance of a clinically relevant target variable but also identifies and prioritizes risk factors by pursuing sparsity of the input variables and encouraging similarity among the input variables and between the input and target variables. Finally, in the third phase, we introduce a new survival analysis framework using deep learning and active learning with a novel sampling strategy. First, our approach provides better representation with lower dimensions from clinical features using labeled (time-to-event) and unlabeled (censored) instances and then actively trains the survival model by labeling the censored data using an oracle. As a clinical assistive tool, we propose a simple yet effective treatment recommendation approach based on our survival model. In the experimental study, we apply our approach on SEER-Medicare data related to prostate cancer among African-Americans and white patients. The results indicate that our approach outperforms significantly than baseline models

    Variable Selection for Nonparametric Gaussian Process Priors: Models and Computational Strategies

    Full text link
    This paper presents a unified treatment of Gaussian process models that extends to data from the exponential dispersion family and to survival data. Our specific interest is in the analysis of data sets with predictors that have an a priori unknown form of possibly nonlinear associations to the response. The modeling approach we describe incorporates Gaussian processes in a generalized linear model framework to obtain a class of nonparametric regression models where the covariance matrix depends on the predictors. We consider, in particular, continuous, categorical and count responses. We also look into models that account for survival outcomes. We explore alternative covariance formulations for the Gaussian process prior and demonstrate the flexibility of the construction. Next, we focus on the important problem of selecting variables from the set of possible predictors and describe a general framework that employs mixture priors. We compare alternative MCMC strategies for posterior inference and achieve a computationally efficient and practical approach. We demonstrate performances on simulated and benchmark data sets.Comment: Published in at http://dx.doi.org/10.1214/11-STS354 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore