1,488 research outputs found

    New Statistical Transfer Learning Models for Health Care Applications

    Get PDF
    abstract: Transfer learning is a sub-field of statistical modeling and machine learning. It refers to methods that integrate the knowledge of other domains (called source domains) and the data of the target domain in a mathematically rigorous and intelligent way, to develop a better model for the target domain than a model using the data of the target domain alone. While transfer learning is a promising approach in various application domains, my dissertation research focuses on the particular application in health care, including telemonitoring of Parkinson’s Disease (PD) and radiomics for glioblastoma. The first topic is a Mixed Effects Transfer Learning (METL) model that can flexibly incorporate mixed effects and a general-form covariance matrix to better account for similarity and heterogeneity across subjects. I further develop computationally efficient procedures to handle unknown parameters and large covariance structures. Domain relations, such as domain similarity and domain covariance structure, are automatically quantified in the estimation steps. I demonstrate METL in an application of smartphone-based telemonitoring of PD. The second topic focuses on an MRI-based transfer learning algorithm for non-invasive surgical guidance of glioblastoma patients. Limited biopsy samples per patient create a challenge to build a patient-specific model for glioblastoma. A transfer learning framework helps to leverage other patient’s knowledge for building a better predictive model. When modeling a target patient, not every patient’s information is helpful. Deciding the subset of other patients from which to transfer information to the modeling of the target patient is an important task to build an accurate predictive model. I define the subset of “transferrable” patients as those who have a positive rCBV-cell density correlation, because a positive correlation is confirmed by imaging theory and the its respective literature. The last topic is a Privacy-Preserving Positive Transfer Learning (P3TL) model. Although negative transfer has been recognized as an important issue by the transfer learning research community, there is a lack of theoretical studies in evaluating the risk of negative transfer for a transfer learning method and identifying what causes the negative transfer. My work addresses this issue. Driven by the theoretical insights, I extend Bayesian Parameter Transfer (BPT) to a new method, i.e., P3TL. The unique features of P3TL include intelligent selection of patients to transfer in order to avoid negative transfer and maintain patient privacy. These features make P3TL an excellent model for telemonitoring of PD using an At-Home Testing Device.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201

    Some New Results on the Estimation of Sinusoids in Noise

    Get PDF

    Multiresolution image models and estimation techniques

    Get PDF

    Vol. 15, No. 1 (Full Issue)

    Get PDF

    Unsupervised methods for speaker diarization

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-95).Given a stream of unlabeled audio data, speaker diarization is the process of determining "who spoke when." We propose a novel approach to solving this problem by taking advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features and exploiting the inherent variabilities in the data through the use of unsupervised methods. Upon initial evaluation, our system achieves state-of-the art results of 0.9% Diarization Error Rate in the diarization of two-speaker telephone conversations. The approach is then generalized to the problem of K-speaker diarization, for which we take measures to address issues of data sparsity and experiment with the use of the von Mises-Fisher distribution for clustering on a unit hypersphere. Our extended system performs competitively on the diarization of conversations involving two or more speakers. Finally, we present promising initial results obtained from applying variational inference on our front-end speaker representation to estimate the unknown number of speakers in a given utterance.by Stephen Shum.S.M

    Methods in machine learning for probabilistic modelling of environment, with applications in meteorology and geology

    Get PDF
    Earth scientists increasingly deal with ‘big data’. Where once we may have struggled to obtain a handful of relevant measurements, we now often have data being collected from multiple sources, on the ground, in the air, and from space. These observations are accumulating at a rate that far outpaces our ability to make sense of them using traditional methods with limited scalability (e.g., mental modelling, or trial-and-error improvement of process based models). The revolution in machine learning offers a new paradigm for modelling the environment: rather than focusing on tweaking every aspect of models developed from the top down based largely on prior knowledge, we now have the capability to instead set up more abstract machine learning systems that can ‘do the tweaking for us’ in order to learn models from the bottom up that can be considered optimal in terms of how well they agree with our (rapidly increasing number of) observations of reality, while still being guided by our prior beliefs. In this thesis, with the help of spatial, temporal, and spatio-temporal examples in meteorology and geology, I present methods for probabilistic modelling of environmental variables using machine learning, and explore the considerations involved in developing and adopting these technologies, as well as the potential benefits they stand to bring, which include improved knowledge-acquisition and decision-making. In each application, the common theme is that we would like to learn predictive distributions for the variables of interest that are well-calibrated and as sharp as possible (i.e., to provide answers that are as precise as possible while remaining honest about their uncertainty). Achieving this requires the adoption of statistical approaches, but the volume and complexity of data available mean that scalability is an important factor — we can only realise the value of available data if it can be successfully incorporated into our models.Engineering and Physical Sciences Research Council (EPSRC

    Flexible Methods for the Analysis of Clustered Event Data in Observational Studies

    Full text link
    Clustered event data are frequently encountered in observational studies. In this dissertation, I am focusing on correlated event outcomes clustered by subjects (multivariate events), facilities, and both hierarchically. The main approaches to analyzing correlated event data include frailty models with random effects and marginal models with robust variance estimation. Difficulties for the existing methods include a) computational demands and speed in the presence of numerous clusters (e.g., recurrent events); b) lacking rigorous diagnostic tools to prespecify the distribution of the random effects; c) analyzing a multi-state model that follows a semi-Markov renewal process. The growing need for flexible, computationally fast, and accurate estimating approaches to analyzing clustered event data motivates my methodological exploration in the following chapters. In Chapter II, I propose a log-normal correlated frailty model to analyze recurrent event incidence rates and duration jointly. The regression parameters are estimated through a penalized partial likelihood, and the variance-covariance matrix of the frailty is estimated via a recursive estimating formula. The proposed methods are more flexible and faster than existing approaches and have the potential to be extended to other frequently encountered data structures (e.g., joint modeling with longitudinal outcomes). In Chapter III, I propose a class of semiparametric frailty models that leave the distribution of frailties unspecified. Parameter estimation proceeds through estimating equations derived from first- and second-moment conditions. Estimation techniques have been developed for three different models, including a shared frailty model for a single event; a correlated frailty model for multiple events; and a hierarchically structured nested failure time model. Extensive simulation studies demonstrate that the proposed approach can accurately estimate the regression parameters, baseline event rates, and variance components. Moreover, the computation time is fast, permitting application to very large data sets. In Chapter IV, I develop a class of multi-state rate models to study the association of exposure to lead, a major endocrine disruptive agent, with behavioral changes captured by accelerometer measurements from wearable device ActiGraph GT3X. Categorized from personal activity counts over time by validated cutoffs, activity states are defined and analyzed through their in-state transitions using the proposed multi-state rate models in which the baseline rates are estimated nonparametrically. The proposed models combine the advantage of regular event rate models with the concept of competing risks, allowing to incorporate a daily renewal property and share baselines in the activity transition rates across different days. The regression parameters are specified in the event rate functions, leading to a semiparametric modeling framework. Statistical inference is based on a robust sandwich variance estimator that accounts for correlations between different event types and their recurrences. I found that the evaluated exposure to lead is associated with an increased transition from low activity to vigorous activity. Chapter V is a special project of modeling the COVID-19 surveillance data in China, in which I develop two extended susceptible-infected-recovered (SIR) state-space models under a Bayesian state-space model framework. I propose to include a time-varying transmission rate or a time-dependent quarantine process in the classical SIR model to assess the effectiveness of macro-control measures issued by the government to mitigate the pandemic. The proposed compartment models enable to predict both short-term and long-term prevalence of the COVID-19 infection with quantification of prediction uncertainty. I provide and maintain an open-source R package on GitHub (lilywang1988/eSIR) for the developed analytics.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163013/1/lilywang_1.pd

    Statistical methods to deflect allele specific expression, alterations of allele specific expression and differential expression

    Get PDF
    The advent of next-generation sequencing (NGS) technology has facilitated the recent development of RNA sequencing (RNA-seq), which is a novel mapping and quantifying method for transcriptomes. By RNA-seq, one can measure the expression of different features such as gene expression, allelic expression, and intragenic expression in the forms of read counts. These features have provided new opportunities to study and interpret the molecular intricacy and variations that are potentially associated with the occurrence of specific diseases. Therefore, there has been an emerging interest in statistical method to analyze the RNA-seq data from different perspectives. In this dissertation, we focus on three important challenges: identifying allele specific expression (ASE) on the gene level and single nucleotide polymorphism (SNP) level simultaneously, the detection of ASE regions in the control group and regions of ASE alterations in case group simultaneously, and detecting genes whose expression levels are significantly different across treatment groups (DE genes). In Chapter 2, we propose a method to test ASE of a gene as a whole and variation in ASE within a gene across exons separately and simultaneously. A generalized linear mixed model is employed to incorporate variations due to genes, SNPs, and biological replicates. To improve reliability of statistical inferences, we assign priors on each effect in the model so that information is shared across genes in the entire genome. We utilize the Bayes factor to test the hypothesis of ASE for each gene and variations across SNPs within a gene. We compare the proposed method to competing approaches through simulation studies that mimicked the real datasets. The proposed method exhibits improved control of the false discovery rate and improved power over existing methods when SNP variation and biological variation are present. Besides, the proposed method also maintains low computational requirements that allows for whole genome analysis. As an example of real data analysis, we apply the proposed method to four tissue types in a bovine study to de novo detect ASE genes in the bovine genome, and uncover intriguing predictions of regulatory ASEs across gene exons and across tissue types. In Chapter 3, we propose a new and powerful algorithm for detecting ASE regions in a healthy control group and regions of ASE alterations in a disease/case group compared to the control. Specifically, we develop a bivariate Bayesian hidden Markov model (HMM) and an expectation-maximization inferential procedure. The proposed algorithm gains advantages over existing methods by addressing their limitations and by recognizing the complexity of biology. First, the bivariate Bayesian HMM detects ASEs for different mRNA isoforms due to alternative splicing and RNA variants. Second, it models spatial correlations among genomic observations, unlike existing methods that often assume independence. At last, the bivariate HMM draws inferences simultaneously for control and case samples, which maximizes the utilization of available information in data. Real data analysis and simulation studies that mimic real data sets are shown to illustrate the improved performance and practical utility of the proposed method. In Chapter 4, we present a new method to detect DE genes in any sequencing experiment. The number of read counts for different treatment groups are modelled by two Negative Binomial distributions which may have different means but share the same dispersion parameter. We propose a mixture prior model for the dispersion parameters with a point mass at zero and a lognormal distribution. The mixture model allows shrinkage across genes within each of the two mixture components, thus prevents the overcorrection resulting from shrinkage across all genes. The simulation studies demonstrate that the proposed method yields a better dispersion estimation and FDR control, and a higher accuracy in gene ranking. In addition, the proposed method exhibits robustness to the misspecification of the bimodal distribution for the dispersion parameters, thus is exible and can be easily generalized

    Behavioral Privacy Risks and Mitigation Approaches in Sharing of Wearable Inertial Sensor Data

    Get PDF
    Wrist-worn inertial sensors in activity trackers and smartwatches are increasingly being used for daily tracking of activity and sleep. Wearable devices, with their onboard sensors, provide appealing mobile health (mHealth) platform that can be leveraged for continuous and unobtrusive monitoring of an individual in their daily life. As a result, an adaptation of wrist-worn devices in many applications (such as health, sport, and recreation) increases. Additionally, an increasing number of sensory datasets consisting of motion sensor data from wrist-worn devices are becoming publicly available for research. However, releasing or sharing these wearable sensor data creates serious privacy concerns of the user. First, in many application domains (such as mHealth, insurance, and health provider), user identity is an integral part of the shared data. In such settings, instead of identity privacy preservation, the focus is more on the behavioral privacy problem that is the disclosure of sensitive behaviors from the shared sensor data. Second, different datasets usually focus on only a select subset of these behaviors. But, in the event that users can be re-identified from accelerometry data, different databases of motion data (contributed by the same user) can be linked, resulting in the revelation of sensitive behaviors or health diagnoses of a user that was neither originally declared by a data collector nor consented by the user. The contributions of this dissertation are multifold. First, to show the behavioral privacy risk in sharing the raw sensor, this dissertation presents a detailed case study of detecting cigarette smoking in the field. It proposes a new machine learning model, called puffMarker, that achieves a false positive rate of 1/6 (or 0.17) per day, with a recall rate of 87.5%, when tested in a field study with 61 newly abstinent daily smokers. Second, it proposes a model-based data substitution mechanism, namely mSieve, to protect behavioral privacy. It evaluates the efficacy of the scheme using 660 hours of raw sensor data collected and demonstrates that it is possible to retain meaningful utility, in terms of inference accuracy (90%), while simultaneously preserving the privacy of sensitive behaviors. Finally, it analyzes the risks of user re-identification from wrist-worn sensor data, even after applying mSieve for protecting behavioral privacy. It presents a deep learning architecture that can identify unique micro-movement pattern in each wearer\u27s wrists. A new consistency-distinction loss function is proposed to train the deep learning model for open set learning so as to maximize re-identification consistency for known users and amplify distinction with any unknown user. In 10 weeks of daily sensor wearing by 353 participants, we show that a known user can be re-identified with a 99.7% true matching rate while keeping the false acceptance rate to 0.1% for an unknown user. Finally, for mitigation, we show that injecting even a low level of Laplace noise in the data stream can limit the re-identification risk. This dissertation creates new research opportunities on understanding and mitigating risks and ethical challenges associated with behavioral privacy

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
    corecore