60 research outputs found

    Robust mixture modeling

    Get PDF
    Doctor of PhilosophyDepartment of StatisticsWeixin Yao and Kun ChenOrdinary least-squares (OLS) estimators for a linear model are very sensitive to unusual values in the design space or outliers among y values. Even one single atypical value may have a large effect on the parameter estimates. In this proposal, we first review and describe some available and popular robust techniques, including some recent developed ones, and compare them in terms of breakdown point and efficiency. In addition, we also use a simulation study and a real data application to compare the performance of existing robust methods under different scenarios. Finite mixture models are widely applied in a variety of random phenomena. However, inference of mixture models is a challenging work when the outliers exist in the data. The traditional maximum likelihood estimator (MLE) is sensitive to outliers. In this proposal, we propose a Robust Mixture via Mean shift penalization (RMM) in mixture models and Robust Mixture Regression via Mean shift penalization (RMRM) in mixture regression, to achieve simultaneous outlier detection and parameter estimation. A mean shift parameter is added to the mixture models, and penalized by a nonconvex penalty function. With this model setting, we develop an iterative thresholding embedded EM algorithm to maximize the penalized objective function. Comparing with other existing robust methods, the proposed methods show outstanding performance in both identifying outliers and estimating the parameters

    Regularized Estimation of High-dimensional Covariance Matrices.

    Full text link
    Many signal processing methods are fundamentally related to the estimation of covariance matrices. In cases where there are a large number of covariates the dimension of covariance matrices is much larger than the number of available data samples. This is especially true in applications where data acquisition is constrained by limited resources such as time, energy, storage and bandwidth. This dissertation attempts to develop necessary components for covariance estimation in the high-dimensional setting. The dissertation makes contributions in two main areas of covariance estimation: (1) high dimensional shrinkage regularized covariance estimation and (2) recursive online complexity regularized estimation with applications of anomaly detection, graph tracking, and compressive sensing. New shrinkage covariance estimation methods are proposed that significantly outperform previous approaches in terms of mean squared error. Two multivariate data scenarios are considered: (1) independently Gaussian distributed data; and (2) heavy tailed elliptically contoured data. For the former scenario we improve on the Ledoit-Wolf (LW) shrinkage estimator using the principle of Rao-Blackwell conditioning and iterative approximation of the clairvoyant estimator. In the latter scenario, we apply a variance normalizing transformation and propose an iterative robust LW shrinkage estimator that is distribution-free within the elliptical family. The proposed robustified estimator is implemented via fixed point iterations with provable convergence and unique limit. A recursive online covariance estimator is proposed for tracking changes in an underlying time-varying graphical model. Covariance estimation is decomposed into multiple decoupled adaptive regression problems. A recursive recursive group lasso is derived using a homotopy approach that generalizes online lasso methods to group sparse system identification. By reducing the memory of the objective function this leads to a group lasso regularized LMS that provably dominates standard LMS. Finally, we introduce a state-of-the-art sampling system, the Modulated Wideband Converter (MWC) which is based on recently developed analog compressive sensing theory. By inferring the block-sparse structures of the high-dimensional covariance matrix from a set of random projections, the MWC is capable of achieving sub-Nyquist sampling for multiband signals with arbitrary carrier frequency over a wide bandwidth.Ph.D.Electrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86396/1/yilun_1.pd

    Approaches for Outlier Detection in Sparse High-Dimensional Regression Models

    Get PDF
    Modern regression studies often encompass a very large number of potential predictors, possibly larger than the sample size, and sometimes growing with the sample size itself. This increases the chances that a substantial portion of the predictors is redundant, as well as the risk of data contamination. Tackling these problems is of utmost importance to facilitate scientific discoveries, since model estimates are highly sensitive both to the choice of predictors and to the presence of outliers. In this thesis, we contribute to this area considering the problem of robust model selection in a variety of settings, where outliers may arise both in the response and the predictors. Our proposals simplify model interpretation, guarantee predictive performance, and allow us to study and control the influence of outlying cases on the fit. First, we consider the co-occurrence of multiple mean-shift and variance-inflation outliers in low-dimensional linear models. We rely on robust estimation techniques to identify outliers of each type, exclude mean-shift outliers, and use restricted maximum likelihood estimation to down-weight and accommodate variance-inflation outliers into the model fit. Second, we extend our setting to high-dimensional linear models. We show that mean-shift and variance-inflation outliers can be modeled as additional fixed and random components, respectively, and evaluated independently. Specifically, we perform feature selection and mean-shift outlier detection through a robust class of nonconcave penalization methods, and variance-inflation outlier detection through the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination – which allows the number of features to exponentially increase with the sample size – and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Third, focusing on high-dimensional linear models affected by meanshift outliers, we develop a general framework in which L0-constraints coupled with mixed-integer programming techniques are used to perform simultaneous feature selection and outlier detection with provably optimal guarantees. In particular, we provide necessary and sufficient conditions for a robustly strong oracle property, where again the number of features can increase exponentially with the sample size, and prove optimality for parameter estimation and the resulting breakdown point. Finally, we consider generalized linear models and rely on logistic slippage to perform outlier detection and removal in binary classification. Here we use L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem of feature selection and outlier detection, and the framework allows us again to pursue optimality guarantees. For all the proposed approaches, we also provide computationally lean heuristic algorithms, tuning procedures, and diagnostic tools which help to guide the analysis. We consider several real-world applications, including the study of the relationships between childhood obesity and the human microbiome, and of the main drivers of honey bee loss. All methods developed and data used, as well as the source code to replicate our analyses, are publicly available

    Employing data fusion & diversity in the applications of adaptive signal processing

    Get PDF
    The paradigm of adaptive signal processing is a simple yet powerful method for the class of system identification problems. The classical approaches consider standard one-dimensional signals whereby the model can be formulated by flat-view matrix/vector framework. Nevertheless, the rapidly increasing availability of large-scale multisensor/multinode measurement technology has render no longer sufficient the traditional way of representing the data. To this end, the author, who from this point onward shall be referred to as `we', `us', and `our' to signify the author myself and other supporting contributors i.e. my supervisor, my colleagues and other overseas academics specializing in the specific pieces of research endeavor throughout this thesis, has applied the adaptive filtering framework to problems that employ the techniques of data diversity and fusion which includes quaternions, tensors and graphs. At the first glance, all these structures share one common important feature: invertible isomorphism. In other words, they are algebraically one-to-one related in real vector space. Furthermore, it is our continual course of research that affords a segue of all these three data types. Firstly, we proposed novel quaternion-valued adaptive algorithms named the n-moment widely linear quaternion least mean squares (WL-QLMS) and c-moment WL-LMS. Both are as fast as the recursive-least-squares method but more numerically robust thanks to the lack of matrix inversion. Secondly, the adaptive filtering method is applied to a more complex task: the online tensor dictionary learning named online multilinear dictionary learning (OMDL). The OMDL is partly inspired by the derivation of the c-moment WL-LMS due to its parsimonious formulae. In addition, the sequential higher-order compressed sensing (HO-CS) is also developed to couple with the OMDL to maximally utilize the learned dictionary for the best possible compression. Lastly, we consider graph random processes which actually are multivariate random processes with spatiotemporal (or vertex-time) relationship. Similar to tensor dictionary, one of the main challenges in graph signal processing is sparsity constraint in the graph topology, a challenging issue for online methods. We introduced a novel splitting gradient projection into this adaptive graph filtering to successfully achieve sparse topology. Extensive experiments were conducted to support the analysis of all the algorithms proposed in this thesis, as well as pointing out potentials, limitations and as-yet-unaddressed issues in these research endeavor.Open Acces

    Online Machine Learning for Inference from Multivariate Time-series

    Get PDF
    Inference and data analysis over networks have become significant areas of research due to the increasing prevalence of interconnected systems and the growing volume of data they produce. Many of these systems generate data in the form of multivariate time series, which are collections of time series data that are observed simultaneously across multiple variables. For example, EEG measurements of the brain produce multivariate time series data that record the electrical activity of different brain regions over time. Cyber-physical systems generate multivariate time series that capture the behaviour of physical systems in response to cybernetic inputs. Similarly, financial time series reflect the dynamics of multiple financial instruments or market indices over time. Through the analysis of these time series, one can uncover important details about the behavior of the system, detect patterns, and make predictions. Therefore, designing effective methods for data analysis and inference over networks of multivariate time series is a crucial area of research with numerous applications across various fields. In this Ph.D. Thesis, our focus is on identifying the directed relationships between time series and leveraging this information to design algorithms for data prediction as well as missing data imputation. This Ph.D. thesis is organized as a compendium of papers, which consists of seven chapters and appendices. The first chapter is dedicated to motivation and literature survey, whereas in the second chapter, we present the fundamental concepts that readers should understand to grasp the material presented in the dissertation with ease. In the third chapter, we present three online nonlinear topology identification algorithms, namely NL-TISO, RFNL-TISO, and RFNL-TIRSO. In this chapter, we assume the data is generated from a sparse nonlinear vector autoregressive model (VAR), and propose online data-driven solutions for identifying nonlinear VAR topology. We also provide convergence guarantees in terms of dynamic regret for the proposed algorithm RFNL-TIRSO. Chapters four and five of the dissertation delve into the issue of missing data and explore how the learned topology can be leveraged to address this challenge. Chapter five is distinct from other chapters in its exclusive focus on edge flow data and introduces an online imputation strategy based on a simplicial complex framework that leverages the known network structure in addition to the learned topology. Chapter six of the dissertation takes a different approach, assuming that the data is generated from nonlinear structural equation models. In this chapter, we propose an online topology identification algorithm using a time-structured approach, incorporating information from both the data and the model evolution. The algorithm is shown to have convergence guarantees achieved by bounding the dynamic regret. Finally, chapter seven of the dissertation provides concluding remarks and outlines potential future research directions.publishedVersio

    Attitudes towards old age and age of retirement across the world: findings from the future of retirement survey

    Get PDF
    The 21st century has been described as the first era in human history when the world will no longer be young and there will be drastic changes in many aspects of our lives including socio-demographics, financial and attitudes towards the old age and retirement. This talk will introduce briefly about the Global Ageing Survey (GLAS) 2004 and 2005 which is also popularly known as “The Future of Retirement”. These surveys provide us a unique data source collected in 21 countries and territories that allow researchers for better understanding the individual as well as societal changes as we age with regard to savings, retirement and healthcare. In 2004, approximately 10,000 people aged 18+ were surveyed in nine counties and one territory (Brazil, Canada, China, France, Hong Kong, India, Japan, Mexico, UK and USA). In 2005, the number was increased to twenty-one by adding Egypt, Germany, Indonesia, Malaysia, Poland, Russia, Saudi Arabia, Singapore, Sweden, Turkey and South Korea). Moreover, an additional 6320 private sector employers was surveyed in 2005, some 300 in each country with a view to elucidating the attitudes of employers to issues relating to older workers. The paper aims to examine the attitudes towards the old age and retirement across the world and will indicate some policy implications

    Variable selection in linear regression models with large number of predictors

    Get PDF
    Tese de doutoramento do Programa Doutoral em Matemática e AplicaçõesIn this thesis, we study the problem of variable selection in linear regression models in the presence of a large number of predictors. Usually, some of these predictors are correlated, so including all of them in a regression model will not essentially improve the model's predictive ability. Also, models with reasonable and tractable amount of predictors are easier to interpret than models with a large number of predictors. Therefore, variable selection is an important problem to study. Given that there are some popular regression methods capable of handling collinearity in data but still requiring the removal of irrelevant predictors, so we present an algorithm that enable these methods to perform variable selection. We review the well-known variable selection methods, and investigate the performance of these methods as well as the proposed approach on both simulated and real data sets. The results show that the new algorithm performs well in selecting the relevant variables. Also, when the data contains outliers, outlier detection and variable selection are not two separable problems. Therefore, we propose a method capable of outlier detection and variable selection. We review the well-known robust variable selection methods and evaluate the performance of these methods with the proposed approach on contaminated simulation data sets as well as on real data. The results show that the proposed method performs well concerning both outlier detection and robust variable selection.Nesta dissertação foi estudado o problema da seleção de variáveis em modelos de regressão linear, na presença de um grande número de variáveis explicativas ou preditoras, em que usualmente, algumas das variáveis explicativas estão correlacionadas. Um princípio a ser levado em consideração e o "princípio da parcimonia": modelos mais simples devem ser escolhidos aos mais complexos, desde que a qualidade do ajustamento/previsão seja similar. Estes modelos são mais fáceis de interpretar do que os modelos com um grande número de preditores. Portanto, o estudo de métodos de seleção de variáveis e um problema muito importante em modelos de regressão. Dado que existem alguns métodos de regressão, já bem conhecidos, capazes de lidar com a multicolinearidade entre os dados, mas ainda não removendo os preditores irrelevantes, apresentamos um algoritmo que permite realizar a seleção de variáveis. São estudados métodos de seleção de variáveis e investigados os desempenhos desses métodos, bem como o desempenho do algoritmo proposto, com dados simulados e com dados reais. Os resultados mostram que o novo algoritmo tem um bom desempenho na seleção das variáveis relevantes para o modelo. Além disso, quando os dados contêm valores atípicos, a detecção de outliers e a seleção de variáveis não podem ser estudados como dois problemas separáveis. Assim, nesta dissertação foi proposto um método capaz de deteção de outliers e de seleção de variáveis, em simultâneo. Foram estudados os métodos de seleção de variáveis robustos mais conhecidos, de forma a avaliar e comparar o desempenho desses métodos com a abordagem proposta neste trabalho com estudos de simulação em situações de contaminação, bem como com dados reais. Os resultados mostram que o método desenvolvido tem um bom desempenho tanto em termos de deteção de outliers, assim como na seleção robusta de variáveis.This work was funded by the Portuguese Foundation for Science and Technology (FCT) under the grant SFRH/BD/51164/2010

    Graphical model driven methods in adaptive system identification

    Get PDF
    Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution September 2016Identifying and tracking an unknown linear system from observations of its inputs and outputs is a problem at the heart of many different applications. Due to the complexity and rapid variability of modern systems, there is extensive interest in solving the problem with as little data and computation as possible. This thesis introduces the novel approach of reducing problem dimension by exploiting statistical structure on the input. By modeling the input to the system of interest as a graph-structured random process, it is shown that a large parameter identification problem can be reduced into several smaller pieces, making the overall problem considerably simpler. Algorithms that can leverage this property in order to either improve the performance or reduce the computational complexity of the estimation problem are developed. The first of these, termed the graphical expectation-maximization least squares (GEM-LS) algorithm, can utilize the reduced dimensional problems induced by the structure to improve the accuracy of the system identification problem in the low sample regime over conventional methods for linear learning with limited data, including regularized least squares methods. Next, a relaxation of the GEM-LS algorithm termed the relaxed approximate graph structured least squares (RAGS-LS) algorithm is obtained that exploits structure to perform highly efficient estimation. The RAGS-LS algorithm is then recast into a recursive framework termed the relaxed approximate graph structured recursive least squares (RAGSRLS) algorithm, which can be used to track time-varying linear systems with low complexity while achieving tracking performance comparable to much more computationally intensive methods. The performance of the algorithms developed in the thesis in applications such as channel identification, echo cancellation and adaptive equalization demonstrate that the gains admitted by the graph framework are realizable in practice. The methods have wide applicability, and in particular show promise as the estimation and adaptation algorithms for a new breed of fast, accurate underwater acoustic modems. The contributions of the thesis illustrate the power of graphical model structure in simplifying difficult learning problems, even when the target system is not directly structured.The work in this thesis was supported primarily by the Office of Naval Research through an ONR Special Research Award in Ocean Acoustics; and at various times by the National Science Foundation, the WHOI Academic Programs Office and the MIT Presidential Fellowship Program
    corecore