60 research outputs found
Robust mixture modeling
Doctor of PhilosophyDepartment of StatisticsWeixin Yao and Kun ChenOrdinary least-squares (OLS) estimators for a linear model are very sensitive to unusual
values in the design space or outliers among y values. Even one single atypical value may have a large effect on the parameter estimates. In this proposal, we first review and describe some available and popular robust techniques, including some recent developed ones, and compare them in terms of breakdown point and efficiency. In addition, we also use a simulation study and a real data application to compare the performance of existing robust methods under different scenarios. Finite mixture models are widely applied in a variety of random phenomena. However, inference of mixture models is a challenging work when the outliers exist in the data. The traditional maximum likelihood estimator (MLE) is sensitive to outliers. In this proposal, we propose a Robust Mixture via Mean shift penalization (RMM) in mixture models and Robust Mixture Regression via Mean shift penalization (RMRM) in mixture regression, to achieve simultaneous outlier detection and parameter estimation. A mean shift parameter is added to the mixture models, and penalized by a nonconvex penalty function. With this model setting, we develop an iterative thresholding embedded EM algorithm to maximize the penalized objective function. Comparing with other existing robust methods, the proposed methods show outstanding performance in both identifying outliers and estimating the parameters
Regularized Estimation of High-dimensional Covariance Matrices.
Many signal processing methods are fundamentally related to the
estimation of covariance matrices. In cases where there are a large
number of covariates the dimension of covariance matrices is much
larger than the number of available data samples. This is especially
true in applications where data acquisition is constrained by limited
resources such as time, energy, storage and bandwidth. This
dissertation attempts to develop necessary components for covariance
estimation in the high-dimensional setting. The dissertation makes
contributions in two main areas of covariance estimation: (1) high
dimensional shrinkage regularized covariance estimation and (2)
recursive online complexity regularized estimation with applications of
anomaly detection, graph tracking, and compressive sensing.
New shrinkage covariance estimation methods are proposed that
significantly outperform previous approaches in terms of mean squared
error. Two multivariate data scenarios are considered: (1)
independently Gaussian distributed data; and (2) heavy tailed
elliptically contoured data. For the former scenario we improve on
the Ledoit-Wolf (LW) shrinkage estimator using the principle of
Rao-Blackwell conditioning and iterative approximation of the
clairvoyant estimator. In the latter scenario, we apply a variance
normalizing transformation and propose an iterative robust LW
shrinkage estimator that is distribution-free within the elliptical
family. The proposed robustified estimator is implemented via fixed
point iterations with provable convergence and unique limit.
A recursive online covariance estimator is proposed for tracking
changes in an underlying time-varying graphical model. Covariance
estimation is decomposed into multiple decoupled adaptive regression
problems. A recursive recursive group lasso is derived using a
homotopy approach that generalizes online lasso methods to group
sparse system identification. By reducing the memory of the objective
function this leads to a group lasso regularized LMS that provably
dominates standard LMS. Finally, we introduce a state-of-the-art
sampling system, the Modulated Wideband Converter (MWC) which is based
on recently developed analog compressive sensing theory. By inferring
the block-sparse structures of the high-dimensional covariance matrix
from a set of random projections, the MWC is capable of achieving
sub-Nyquist sampling for multiband signals with arbitrary carrier
frequency over a wide bandwidth.Ph.D.Electrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86396/1/yilun_1.pd
Approaches for Outlier Detection in Sparse High-Dimensional Regression Models
Modern regression studies often encompass a very large number of potential predictors,
possibly larger than the sample size, and sometimes growing with the sample
size itself. This increases the chances that a substantial portion of the predictors
is redundant, as well as the risk of data contamination. Tackling these problems is
of utmost importance to facilitate scientific discoveries, since model estimates are
highly sensitive both to the choice of predictors and to the presence of outliers. In
this thesis, we contribute to this area considering the problem of robust model selection
in a variety of settings, where outliers may arise both in the response and
the predictors. Our proposals simplify model interpretation, guarantee predictive
performance, and allow us to study and control the influence of outlying cases on
the fit.
First, we consider the co-occurrence of multiple mean-shift and variance-inflation
outliers in low-dimensional linear models. We rely on robust estimation techniques
to identify outliers of each type, exclude mean-shift outliers, and use restricted
maximum likelihood estimation to down-weight and accommodate variance-inflation
outliers into the model fit. Second, we extend our setting to high-dimensional linear
models. We show that mean-shift and variance-inflation outliers can be modeled as
additional fixed and random components, respectively, and evaluated independently.
Specifically, we perform feature selection and mean-shift outlier detection through
a robust class of nonconcave penalization methods, and variance-inflation outlier
detection through the penalization of the restricted posterior mode. The resulting
approach satisfies a robust oracle property for feature selection in the presence of
data contamination – which allows the number of features to exponentially increase
with the sample size – and detects truly outlying cases of each type with asymptotic
probability one. This provides an optimal trade-off between a high breakdown point
and efficiency. Third, focusing on high-dimensional linear models affected by meanshift
outliers, we develop a general framework in which L0-constraints coupled with
mixed-integer programming techniques are used to perform simultaneous feature
selection and outlier detection with provably optimal guarantees. In particular,
we provide necessary and sufficient conditions for a robustly strong oracle property,
where again the number of features can increase exponentially with the sample size,
and prove optimality for parameter estimation and the resulting breakdown point.
Finally, we consider generalized linear models and rely on logistic slippage to perform
outlier detection and removal in binary classification. Here we use L0-constraints
and mixed-integer conic programming techniques to solve the underlying double
combinatorial problem of feature selection and outlier detection, and the framework
allows us again to pursue optimality guarantees.
For all the proposed approaches, we also provide computationally lean heuristic
algorithms, tuning procedures, and diagnostic tools which help to guide the analysis.
We consider several real-world applications, including the study of the relationships
between childhood obesity and the human microbiome, and of the main drivers of
honey bee loss. All methods developed and data used, as well as the source code to
replicate our analyses, are publicly available
Employing data fusion & diversity in the applications of adaptive signal processing
The paradigm of adaptive signal processing is a simple yet powerful method for the class of system identification problems. The classical approaches consider standard one-dimensional signals whereby the model can be formulated by flat-view matrix/vector framework. Nevertheless, the rapidly increasing availability of large-scale multisensor/multinode measurement technology has render no longer sufficient the traditional way of representing the data. To this end, the author, who from this point onward shall be referred to as `we', `us', and `our' to signify the author myself and other supporting contributors i.e. my supervisor, my colleagues and other overseas academics specializing in the specific pieces of research endeavor throughout this thesis, has applied the adaptive filtering framework to problems that employ the techniques of data diversity and fusion which includes quaternions, tensors and graphs. At the first glance, all these structures share one common important feature: invertible isomorphism. In other words, they are algebraically one-to-one related in real vector space. Furthermore, it is our continual course of research that affords a segue of all these three data types. Firstly, we proposed novel quaternion-valued adaptive algorithms named the n-moment widely linear quaternion least mean squares (WL-QLMS) and c-moment WL-LMS. Both are as fast as the recursive-least-squares method but more numerically robust thanks to the lack of matrix inversion. Secondly, the adaptive filtering method is applied to a more complex task: the online tensor dictionary learning named online multilinear dictionary learning (OMDL). The OMDL is partly inspired by the derivation of the c-moment WL-LMS due to its parsimonious formulae. In addition, the sequential higher-order compressed sensing (HO-CS) is also developed to couple with the OMDL to maximally utilize the learned dictionary for the best possible compression. Lastly, we consider graph random processes which actually are multivariate random processes with spatiotemporal (or vertex-time) relationship. Similar to tensor dictionary, one of the main challenges in graph signal processing is sparsity constraint in the graph topology, a challenging issue for online methods. We introduced a novel splitting gradient projection into this adaptive graph filtering to successfully achieve sparse topology. Extensive experiments were conducted to support the analysis of all the algorithms proposed in this thesis, as well as pointing out potentials, limitations and as-yet-unaddressed issues in these research endeavor.Open Acces
Online Machine Learning for Inference from Multivariate Time-series
Inference and data analysis over networks have become significant areas of research due to the increasing prevalence of interconnected systems and the growing volume of data they produce. Many of these systems generate data in the form of multivariate time series, which are collections of time series data that are observed simultaneously across multiple variables. For example, EEG measurements of the brain produce multivariate time series data that record the electrical activity of different brain regions over time. Cyber-physical systems generate multivariate time series that capture the behaviour of physical systems in response to cybernetic inputs. Similarly, financial time series reflect the dynamics of multiple financial instruments or market indices over time. Through the analysis of these time series, one can uncover important details about the behavior of the system, detect patterns, and make predictions. Therefore, designing effective methods for data analysis and inference over networks of multivariate time series is a crucial area of research with numerous applications across various fields. In this Ph.D. Thesis, our focus is on identifying the directed relationships between time series and leveraging this information to design algorithms for data prediction as well as missing data imputation. This Ph.D. thesis is organized as a compendium of papers, which consists of seven chapters and appendices. The first chapter is dedicated to motivation and literature survey, whereas in the second chapter, we present the fundamental concepts that readers should understand to grasp the material presented in the dissertation with ease. In the third chapter, we present three online nonlinear topology identification algorithms, namely NL-TISO, RFNL-TISO, and RFNL-TIRSO. In this chapter, we assume the data is generated from a sparse nonlinear vector autoregressive model (VAR), and propose online data-driven solutions for identifying nonlinear VAR topology. We also provide convergence guarantees in terms of dynamic regret for the proposed algorithm RFNL-TIRSO. Chapters four and five of the dissertation delve into the issue of missing data and explore how the learned topology can be leveraged to address this challenge. Chapter five is distinct from other chapters in its exclusive focus on edge flow data and introduces an online imputation strategy based on a simplicial complex framework that leverages the known network structure in addition to the learned topology. Chapter six of the dissertation takes a different approach, assuming that the data is generated from nonlinear structural equation models. In this chapter, we propose an online topology identification algorithm using a time-structured approach, incorporating information from both the data and the model evolution. The algorithm is shown to have convergence guarantees achieved by bounding the dynamic regret. Finally, chapter seven of the dissertation provides concluding remarks and outlines potential future research directions.publishedVersio
Attitudes towards old age and age of retirement across the world: findings from the future of retirement survey
The 21st century has been described as the first era in human history when the world will no longer be young and there will be drastic changes in many aspects of our lives including socio-demographics, financial and attitudes towards the old age and retirement. This talk will introduce briefly about the Global Ageing Survey (GLAS) 2004 and 2005 which is also popularly known as “The Future of Retirement”. These surveys provide us a unique data source collected in 21 countries and territories that allow researchers for better understanding the individual as well as societal changes as we age with regard to savings, retirement and healthcare. In 2004, approximately 10,000 people aged 18+ were surveyed in nine counties and one territory (Brazil, Canada, China, France, Hong Kong, India, Japan, Mexico, UK and USA). In 2005, the number was increased to twenty-one by adding Egypt, Germany, Indonesia, Malaysia, Poland, Russia, Saudi Arabia, Singapore, Sweden, Turkey and South Korea). Moreover, an additional 6320 private sector employers was surveyed in 2005, some 300 in each country with a view to elucidating the attitudes of employers to issues relating to older workers. The paper aims to examine the attitudes towards the old age and retirement across the world and will indicate some policy implications
Variable selection in linear regression models with large number of predictors
Tese de doutoramento do Programa Doutoral em Matemática e AplicaçõesIn this thesis, we study the problem of variable selection in linear regression models
in the presence of a large number of predictors. Usually, some of these predictors are
correlated, so including all of them in a regression model will not essentially improve
the model's predictive ability. Also, models with reasonable and tractable amount
of predictors are easier to interpret than models with a large number of predictors.
Therefore, variable selection is an important problem to study. Given that there are
some popular regression methods capable of handling collinearity in data but still
requiring the removal of irrelevant predictors, so we present an algorithm that enable
these methods to perform variable selection. We review the well-known variable
selection methods, and investigate the performance of these methods as well as the
proposed approach on both simulated and real data sets. The results show that the
new algorithm performs well in selecting the relevant variables.
Also, when the data contains outliers, outlier detection and variable selection are
not two separable problems. Therefore, we propose a method capable of outlier
detection and variable selection. We review the well-known robust variable selection
methods and evaluate the performance of these methods with the proposed approach
on contaminated simulation data sets as well as on real data. The results show
that the proposed method performs well concerning both outlier detection and robust
variable selection.Nesta dissertação foi estudado o problema da seleção de variáveis em modelos de
regressão linear, na presença de um grande número de variáveis explicativas ou preditoras,
em que usualmente, algumas das variáveis explicativas estão correlacionadas.
Um princípio a ser levado em consideração e o "princípio da parcimonia": modelos
mais simples devem ser escolhidos aos mais complexos, desde que a qualidade do
ajustamento/previsão seja similar. Estes modelos são mais fáceis de interpretar do
que os modelos com um grande número de preditores. Portanto, o estudo de métodos
de seleção de variáveis e um problema muito importante em modelos de regressão.
Dado que existem alguns métodos de regressão, já bem conhecidos, capazes de lidar
com a multicolinearidade entre os dados, mas ainda não removendo os preditores
irrelevantes, apresentamos um algoritmo que permite realizar a seleção de variáveis.
São estudados métodos de seleção de variáveis e investigados os desempenhos desses
métodos, bem como o desempenho do algoritmo proposto, com dados simulados e
com dados reais. Os resultados mostram que o novo algoritmo tem um bom desempenho
na seleção das variáveis relevantes para o modelo. Além disso, quando os dados
contêm valores atípicos, a detecção de outliers e a seleção de variáveis não podem
ser estudados como dois problemas separáveis. Assim, nesta dissertação foi proposto
um método capaz de deteção de outliers e de seleção de variáveis, em simultâneo.
Foram estudados os métodos de seleção de variáveis robustos mais conhecidos, de
forma a avaliar e comparar o desempenho desses métodos com a abordagem proposta
neste trabalho com estudos de simulação em situações de contaminação, bem como
com dados reais. Os resultados mostram que o método desenvolvido tem um bom
desempenho tanto em termos de deteção de outliers, assim como na seleção robusta
de variáveis.This work was funded by the Portuguese Foundation for Science and Technology (FCT) under the grant SFRH/BD/51164/2010
Graphical model driven methods in adaptive system identification
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution September 2016Identifying and tracking an unknown linear system from observations of its inputs and outputs
is a problem at the heart of many different applications. Due to the complexity and
rapid variability of modern systems, there is extensive interest in solving the problem with
as little data and computation as possible.
This thesis introduces the novel approach of reducing problem dimension by exploiting
statistical structure on the input. By modeling the input to the system of interest as a
graph-structured random process, it is shown that a large parameter identification problem
can be reduced into several smaller pieces, making the overall problem considerably simpler.
Algorithms that can leverage this property in order to either improve the performance
or reduce the computational complexity of the estimation problem are developed. The first
of these, termed the graphical expectation-maximization least squares (GEM-LS) algorithm,
can utilize the reduced dimensional problems induced by the structure to improve the accuracy
of the system identification problem in the low sample regime over conventional methods
for linear learning with limited data, including regularized least squares methods.
Next, a relaxation of the GEM-LS algorithm termed the relaxed approximate graph
structured least squares (RAGS-LS) algorithm is obtained that exploits structure to perform
highly efficient estimation. The RAGS-LS algorithm is then recast into a recursive
framework termed the relaxed approximate graph structured recursive least squares (RAGSRLS)
algorithm, which can be used to track time-varying linear systems with low complexity
while achieving tracking performance comparable to much more computationally intensive
methods.
The performance of the algorithms developed in the thesis in applications such as channel
identification, echo cancellation and adaptive equalization demonstrate that the gains admitted
by the graph framework are realizable in practice. The methods have wide applicability,
and in particular show promise as the estimation and adaptation algorithms for a new breed
of fast, accurate underwater acoustic modems.
The contributions of the thesis illustrate the power of graphical model structure in simplifying
difficult learning problems, even when the target system is not directly structured.The work in this thesis was supported primarily by the Office of Naval Research through
an ONR Special Research Award in Ocean Acoustics; and at various times by the National
Science Foundation, the WHOI Academic Programs Office and the MIT Presidential Fellowship
Program
- …