143 research outputs found

    Comparison and classification of flexible distributions for multivariate skew and heavy-tailed data

    Get PDF
    We present, compare and classify popular families of flexible multivariate distributions. Our classification is based on the type of symmetry (spherical, elliptical, central symmetry or asymmetry) and the tail behaviour (a single tail weight parameter or multiple tail weight parameters). We compare the families both theoretically (relevant properties and distinctive features) and with a Monte Carlo study (comparing the fitting abilities in finite samples)

    Adjusting inverse regression for predictors with clustered distribution

    Full text link
    A major family of sufficient dimension reduction (SDR) methods, called inverse regression, commonly require the distribution of the predictor XX to have a linear E(X∣βTX)E(X|\beta^\mathsf{T}X) and a degenerate var(X∣βTX)\mathrm{var}(X|\beta^\mathsf{T}X) for the desired reduced predictor βTX\beta^\mathsf{T}X. In this paper, we adjust the first and second-order inverse regression methods by modeling E(X∣βTX)E(X|\beta^\mathsf{T}X) and var(X∣βTX)\mathrm{var}(X|\beta^\mathsf{T}X) under the mixture model assumption on XX, which allows these terms to convey more complex patterns and is most suitable when XX has a clustered sample distribution. The proposed SDR methods build a natural path between inverse regression and the localized SDR methods, and in particular inherit the advantages of both; that is, they are n\sqrt{n}-consistent, efficiently implementable, directly adjustable under the high-dimensional settings, and fully recovering the desired reduced predictor. These findings are illustrated by simulation studies and a real data example at the end, which also suggest the effectiveness of the proposed methods for nonclustered data

    Methodological and Computational Advances for High–Dimensional Bayesian Regression with Binary and Categorical Responses

    Get PDF
    Probit and logistic regressions are among the most popular and well-established formulations to model binary observations, thanks to their plain structure and high interpretability. Despite their simplicity, their use poses non-trivial hindrances to the inferential procedure, particularly from a computational perspective and in high-dimensional scenarios. This still motivates thriving active research for probit, logit, and a number of their generalizations, especially within the Bayesian community. Conjugacy results for standard probit regression under normal and unified skew-normal (SUN) priors appeared only recently in the literature. Such findings were rapidly extended to different generalizations of probit regression, including multinomial probit, dynamic multivariate probit and skewed Gaussian processes among others. Nonetheless, these recent developments focus on specific subclasses of models, which can all be regarded as instances of a potentially broader family of formulations, that rely on partially or fully discretized Gaussian latent utilities. As such, we develop a unified comprehensive framework that encompasses all the above constructions and many others, such as tobit regression and its extensions, for which conjugacy results are yet missing. We show that the SUN family of distribution is conjugate for all models within the broad class considered, which notably encompasses all formulations relying on likelihoods given by the product of multivariate Gaussian densities and cumulative distributions, evaluated at a linear combination of the parameter of interest. Such a unifying framework is practically and conceptually useful for studying general theoretical properties and developing future extensions. This includes new avenues for improved posterior inference exploiting i.i.d. samplers from the exact SUN posteriors and recent accurate and scalable variational Bayes (VB) approximations and expectation-propagation, for which we derive a novel efficient implementation. Along a parallel research line, we focus on binary regression under logit mapping, for which computations in high dimensions still pose open challenges. To overcome such difficulties, several contributions focus on solving iteratively a series of surrogate problems, entailing the sequential refinement of tangent lower bounds for the logistic log-likelihoods. For instance, tractable quadratic minorizers can be exploited to obtain maximum likelihood (ML) and maximum a posteriori estimates via minorize-maximize and expectation-maximization schemes, with desirable convergence guarantees. Likewise, quadratic surrogates can be used to construct Gaussian approximations of the posterior distribution in mean-field VB routines, which might however suffer from low accuracy in high dimensions. This issue can be mitigated by resorting to more flexible but involved piece-wise quadratic bounds, that however are typically defined in an implicit way and entail reduced tractability as the number of pieces increases. For this reason, we derive a novel tangent minorizer for logistic log-likelihoods, that combines the quadratic term with a single piece-wise linear contribution per each observation, proportional to the absolute value of the corresponding linear predictor. The proposed bound is guaranteed to improve the accuracy over the sharpest among quadratic minorizers, while minimizing the reduction in tractability compared to general piece-wise quadratic bounds. As opposed to the latter, its explicit analytical expression allows to simplify computations by exploiting a renowned scale-mixture representation of Laplace random variables. We investigate the benefit of the proposed methodology both in the context of penalized ML estimation, where it leads to a faster convergence rate of the optimization procedure, and of VB approximation, as the resulting accuracy improvement over mean-field strategies can be substantial in skewed and high-dimensional scenarios

    Robust Nonparametric Inference

    Get PDF
    In this article, we provide a personal review of the literature on nonparametric and robust tools in the standard univariate and multivariate location and scatter, as well as linear regression problems, with a special focus on sign and rank methods, their equivariance and invariance properties, and their robustness and efficiency. Beyond parametric models, the population quantities of interest are often formulated as location, scatter, skewness, kurtosis and other functionals. Some old and recent tools for model checking, dimension reduction, and subspace estimation in wide semiparametric models are discussed. We also discuss recent extensions of procedures in certain nonstandard semiparametric cases including clustered and matrix-valued data. Our personal list of important unsolved and future issues is provided

    On Independent Component Analysis and Supervised Dimension Reduction for Time Series

    Get PDF
    The main goal of this thesis work has been to develop tools to recover hidden structures, latent variables, or latent subspaces for multivariate and dependent time series data. The secondary goal has been to write computationally efficient algorithms for the methods to an R-package. In Blind Source Separation (BSS) the goal is to find uncorrelated latent sources by transforming the observed data in an appropriate way. In Independent Component Analysis (ICA) the latent sources are assumed to be independent. The well-known ICA methods FOBI and JADE are generalized to work with multivariate time series, where the latent components exhibit stochastic volatility. In such time series the volatility cannot be regarded as a constant in time, as often there are periods of high and periods of low volatility. The new methods are called gFOBI and gJADE. Also SOBI, a classic method which works well once the volatility is assumed to be constant, is given a variant called vSOBI, that also works with time series with stochastic volatility. In dimension reduction the idea is to transform the data into a new coordinate system, where the components are uncorrelated or even independent, and then keep only some of the transformed variables in such way that we do not lose too much of the important information of the data. The aforementioned BSS methods can be used in unsupervised dimension reduction; all the variables or time series have the same role. In supervised dimension reduction the relationship between a response and predictor variables needs to be considered as well. Wellknown supervised dimension reduction methods for independent and identically distributed data, SIR and SAVE, are generalized to work for time series data. The methods TSIR and TSAVE are introduced and shown to work well for time series, as they also use the information on the past values of the predictor time series. Also TSSH, a hybrid version of TSIR and TSAVE, is introduced. All the methods that have been developed in this thesis have also been implemented in R package tsBSS

    Linear and nonlinear mixed-effects models with censored response using the multivariate normal and Student-t distributions

    Get PDF
    Orientador: Víctor Hugo Lachos DávilaDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matemática, Estatística e Computação CientíficaResumo: Modelos mistos são geralmente usados para representar dados longitudinais ou de medidas repetidas. Uma complicação adicional surge quando a resposta é censurada, por exemplo, devido aos limites de quantificação do ensaio utilizado. Distribuições normais para os efeitos aleatórios e os erros residuais são geralmente assumidas, mas tais pressupostos fazem as inferências vulneráveis, 'a presença de outliers. Motivados por uma preocupação da sensibilidade para potenciais outliers ou dados com caudas mais pesadas do que a normal, pretendemos desenvolver nessa dissertação, inferência para modelos lineares e não lineares de efeito misto censurados (NLMEC / LMEC) com base na distribui ção t- Student multivariada, sendo uma alternativa flexível ao uso da distribuição normal correspondente. Propomos um algoritmo ECM para computar as estimativas de máxima verossimilhança para os NLMEC / LMEC. Este algoritmo utiliza expressões fechadas no passo-E, que se baseia em fórmulas para a média e a variância de uma distribui ção t-multivariada truncada. O algoritmo proposto é implementado, pacote tlmec do R. Também propomos aqui um algoritmo ECM exato para os modelos lineares e não lineares de efeito misto censurados, com base na distribuição normal multivariada, que nos permite desenvolver análise de influência local para modelos de efeito misto com base na esperança condicional da função log-verossilhança dos dados completos. Os procedimentos desenvolvidos são ilustrados com a análise longitudinal da carga viral do HIV, apresentada em dois estudos recentes sobre a AIDSAbstract: Mixed models are commonly used to represent longitudinal or repeated measures data. An additional complication arises when the response is censored, for example, due to limits of quantification of the assay used. Normal distributions for random effects and residual errors are usually assumed, but such assumptions make inferences vulnerable to the presence of outliers. Motivated by a concern of sensitivity to potential outliers or data with tails longer-than-normal, we aim to develop in this dissertation inference for linear and nonlinear mixed effects models with censored response (NLMEC/LMEC) based on the multivariate Student-t distribution, being a flexible alternative to the use of the corresponding normal distribution. We propose an ECM algorithm for computing the maximum likelihood estimates for NLMEC/LMEC. This algorithm uses closed-form expressions at the E-step, which relies on formulas for the mean and variance of a truncated multivariate-t distribution. The proposed algorithm is implemented in the R package tlmec. We also propose here an exact ECM algorithm for linear and nonlinear mixed effects models with censored response based on the multivariate normal distribution, which enable us to developed local influence analysis for mixed effects models on the basis of the conditional expectation of the complete-data log-likelihood function. The developed procedures are illustrated with two case studies, involving the analysis of longitudinal HIV viral load in two recent AIDS studiesMestradoEstatisticaMestre em Estatístic

    Copula models for epidemiological research and practice

    Get PDF
    Investigating associations between random variables (rvs) is one of many topics in the heart of statistical science. Graphical displays show emerging patterns between rvs, and the strength of their association is conventionally quantified via correlation coefficients. When two or more of these rvs are thought of as outcomes, their association is governed by a joint probability distribution function (pdf). When the joint pdf is bivariate normal, scalar correlation coefficients will produce a satisfactory summary of the association, otherwise alternative measures are needed. Local dependence functions, together with their corresponding graphical displays, quantify and show how the strength of the association varies across the span of the data. Additionally, the multivariate distribution function can be explicitly formulated and explored. Copulas model joint distributions of varying shapes by combining the separate (univariate) marginal cumulative distribution functions of each rv under a specified correlation structure. Copula models can be used to analyse complex relationships and incorporate covariates into their parameters. Therefore, they offer increased flexibility in modelling dependence between rvs. Copula models may also be used to construct bivariate analogues of centiles, an application for which few references are available in the literature though it is of particular interest for many paediatric applications. Population centiles are widely used to highlight children or adults who have unusual univariate outcomes. Whilst the methodology for the construction of univariate centiles is well established there has been very little work in the area of bivariate analogues of centiles where two outcomes are jointly considered. Conditional models can increase the efficiency of centile analogues in detection of individuals who require some form of intervention. Such adjustments can be readily incorporated into the modelling of the marginal distributions and of the dependence parameter within the copula model

    Asymmetry and fat-tails in financial time series

    Get PDF

    Machine Learning Developments in Dependency Modelling and Feature Extraction

    Get PDF
    Three complementary feature extraction approaches are developed in this thesis which addresses the challenge of dimensionality reduction in the presence of multivariate heavy-tailed and asymmetric distributions. First, we demonstrate how to improve the robustness of the standard Probabilistic Principal Component Analysis by adapting the concept of robust mean and covariance estimation within the standard framework. We then introduce feature extraction methods that extend the standard Principal Component Analysis by exploring distribution-based robustification. This is achieved via Probabilistic Principal Component Analysis (PPCA), in which new, statistically robust variants are derived, also treating missing data. We propose a novel generalisation to the t-Student Probabilistic Principal Component methodology which (1) accounts for asymmetric distribution of the observation data, (2) is a framework for grouped and generalised multiple-degree-of-freedom structures, which provides a more flexible framework to model groups of marginal tail dependence in the observation data, and (3) separates the tail effect of the error terms and factors. The new feature extraction methods are derived in an incomplete data setting to efficiently handle the presence of missing values in the observation vector. We discuss statistical properties of their robustness. In the next part of this thesis, we demonstrate the applicability of feature extraction methods to the statistical analysis of multidimensional dynamics. We introduce the class of Hybrid Factor models that combines classical state-space model formulations with incorporation of exogenous factors. We show how to utilize the information obtained from features extracted using introduced robust PPCA in a modelling framework in a meaningful and parsimonious manner. In the first application study, we show the applicability of robust feature extraction methods in the real data environment of financial markets and combine the obtained results with a stochastic multi-factor panel regression-based state-space model in order to model the dynamic of yield curves, whilst incorporating regression factors. We embed the rank-reduced feature extractions into a stochastic representation of state-space models for yield curve dynamics and compare the results to classical multi-factor dynamic Nelson-Siegel state-space models. This leads to important new representations of yield curve models that can have practical importance for addressing questions of financial stress testing and monetary policy interventions which can efficiently incorporate financial big data. We illustrate our results on various financial and macroeconomic data sets from the Euro Zone and international markets. In the second study, we develop a multi-factor extension of the family of Lee-Carter stochastic mortality models. We build upon the time, period and cohort stochastic model structure to include exogenous observable demographic features that can be used as additional factors to improve model fit and forecasting accuracy. We develop a framework in which (a) we employ projection-based techniques of dimensionality reduction that are amenable to different structures of demographic data; (b) we analyse demographic data sets from the patterns of missingness and the impact of such missingness on the feature extraction; (c) we introduce a class of multi-factor stochastic mortality models incorporating time, period, cohort and demographic features, which we develop within a Bayesian state-space estimation framework. Finally (d) we develop an efficient combined Markov chain and filtering framework for sampling the posterior and forecasting. We undertake a detailed case study on the Human Mortality Database demographic data from European countries and we use the extracted features to better explain the term structure of mortality in the UK over time for male and female populations. This is compared to a pure Lee-Carter stochastic mortality model, demonstrating that our feature extraction framework and consequent multi-factor mortality model improves both in-sample fit and, importantly, out-of-sample mortality forecasts by a non-trivial gain in performance
    • …
    corecore