101 research outputs found

    Assessment and Improvement of a Sequential Regression Multivariate Imputation Algorithm.

    Full text link
    The sequential regression multivariate imputation (SRMI, also known as chained equations or fully conditional specifications) is a popular approach for handling missing values in highly complex data structures with many types of variables, structural dependencies among the variables and bounds on plausible imputation values. It is a Gibbs style algorithm with iterative draws from the posterior predictive distribution of missing values in any given variable, conditional on all observed and imputed values of all other variables. However, a theoretical weakness of this approach is that the specification of a set of fully conditional regression models may not be compatible with a joint distribution of the variables being imputed. Hence, the convergence properties of the iterative algorithm are not well understood. The dissertation will focus on assessing and improving the SRMI algorithm. Chapter 2 develops conditions for convergence and assesses the properties of inferences from both compatible and incompatible sequences of generalized linear regression models. The results are established for the missing data pattern where each subject may be missing a value on at most one variable. The results are used to develop criteria for the choice of regression models. Chapter 3 proposes a modified block sequential regression multivariate imputation (BSRMI) approach to divide the data into blocks for each variable based on missing data patterns and tune the regression models through compatibility restrictions. This is extremely helpful to avoid divergence when the data are missing in general patterns and when it is difficult to get well fitting models across all missing data patterns. Conditions for the convergence of the algorithm are established, and the repeated sampling properties of inferences using several simulated data sets are studied. Chapter 4 extends the imputation model selection to quasi-likelihood regression models in both SRMI and BSRMI to better capture structure in the prediction model for the missing values. The performance of the modified approach is examined through simulation studies. The results show that extension to quasi-likelihood regression models makes it easier to choose better fitting model sequences to yield desirable repeated sampling properties of the multiple imputation estimates.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133402/1/jianzhu_1.pd

    Lightweight Automated Feature Monitoring for Data Streams

    Full text link
    Monitoring the behavior of automated real-time stream processing systems has become one of the most relevant problems in real world applications. Such systems have grown in complexity relying heavily on high dimensional input data, and data hungry Machine Learning (ML) algorithms. We propose a flexible system, Feature Monitoring (FM), that detects data drifts in such data sets, with a small and constant memory footprint and a small computational cost in streaming applications. The method is based on a multi-variate statistical test and is data driven by design (full reference distributions are estimated from the data). It monitors all features that are used by the system, while providing an interpretable features ranking whenever an alarm occurs (to aid in root cause analysis). The computational and memory lightness of the system results from the use of Exponential Moving Histograms. In our experimental study, we analyze the system's behavior with its parameters and, more importantly, show examples where it detects problems that are not directly related to a single feature. This illustrates how FM eliminates the need to add custom signals to detect specific types of problems and that monitoring the available space of features is often enough.Comment: 10 pages, 5 figures. AutoML, KDD22, August 14-17, 2022, Washington, DC, U

    Real-time detection of overlapping sound events with non-negative matrix factorization

    Get PDF
    International audienceIn this paper, we investigate the problem of real-time detection of overlapping sound events by employing non-negative matrix factorization techniques. We consider a setup where audio streams arrive in real-time to the system and are decomposed onto a dictionary of event templates learned off-line prior to the decomposition. An important drawback of existing approaches in this context is the lack of controls on the decomposition. We propose and compare two provably convergent algorithms that address this issue, by controlling respectively the sparsity of the decomposition and the trade-off of the decomposition between the different frequency components. Sparsity regularization is considered in the framework of convex quadratic programming, while frequency compromise is introduced by employing the beta-divergence as a cost function. The two algorithms are evaluated on the multi-source detection tasks of polyphonic music transcription, drum transcription and environmental sound recognition. The obtained results show how the proposed approaches can improve detection in such applications, while maintaining low computational costs that are suitable for real-time

    Scalable transformed additive signal decomposition by non-conjugate Gaussian process inference

    Get PDF
    Many functions and signals of interest are formed by the addition of multiple underlying components, often nonlinearly transformed and modified by noise. Examples may be found in the literature on Generalized Additive Models [1] and Underdetermined Source Separation [2] or other mode decomposition techniques. Recovery of the underlying component processes often depends on finding and exploiting statistical regularities within them. Gaussian Processes (GPs) [3] have become the dominant way to model statistical expectations over functions. Recent advances make inference of the GP posterior efficient for large scale datasets and arbitrary likelihoods [4,5]. Here we extend these methods to the additive GP case [6, 7], thus achieving scalable marginal posterior inference over each latent function in settings such as those above
    corecore