5 research outputs found

    Influence of single observations on the choice of the penalty parameter in ridge regression

    Full text link
    Penalized regression methods, such as ridge regression, heavily rely on the choice of a tuning, or penalty, parameter, which is often computed via cross-validation. Discrepancies in the value of the penalty parameter may lead to substantial differences in regression coefficient estimates and predictions. In this paper, we investigate the effect of single observations on the optimal choice of the tuning parameter, showing how the presence of influential points can dramatically change it. We distinguish between points as "expanders" and "shrinkers", based on their effect on the model complexity. Our approach supplies a visual exploratory tool to identify influential points, naturally implementable for high-dimensional data where traditional approaches usually fail. Applications to real data examples, both low- and high-dimensional, and a simulation study are presented.Comment: 26 pages, 6 figure

    Dealing with influential observations in accounting empirical research

    Get PDF
    JEL Classification System: C51 – Econometric Modelling: Model Construction and Estimation; M41 – AccountingThe main objective of this dissertation is the study of influential observations and their treatment in the linear regression model. When the linear regression model is applied, the observations have different influence in the estimation results and their importance and influence will induce to wrong results if the empirical studies are not correctly treated. To detect these observations (influential observations) is indispensable to apply diagnostic measures and then proceed to the respective treatment (generally their exclusion). Thus, the purpose of this investigation is to analyse some accounting published articles whose statistic treatment is not the more technically appropriate accordingly the econometric books, inducing to distorted results because of the incorrect form that these authors deal with that observations. Therefore, this investigation is composed by three parts. Firstly, it will be done a theoretical framework of what are influential observations, their importance and the methodology that should be used in their identification; then, it will be analysed the methodology used to detect influential observations by various published accounting empirical studies; and, our final objective is to perform an empirical study that consists in treat technically and correctly the influential observations and compare the results of the regression model estimation with the results that we would obtained if were considered the traditional criteria adopted to identify the influential observations in empirical accounting.Esta dissertação tem por objectivo o estudo das observações influentes bem como o seu tratamento no modelo clássico de regressão linear. Na aplicação do modelo de regressão linear as observações têm diferentes pesos, pelo que a sua importância e influência podem induzir a resultados enganadores nos estudos empíricos se não forem tratadas de uma forma correcta. Para detectar esse tipo de observações é necessário recorrer a um conjunto de medidas de diagnóstico para que depois se possa proceder ao respectivo tratamento (geralmente a exclusão). Assim, esta investigação tem por objectivo a análise de vários artigos publicados na área da contabilidade e cujo tratamento estatístico das observações influentes não está conforme as sugestões dos manuais de econometria podendo levar a conclusões distorcidas pela forma incorrecta como se lida com tais observações. Deste modo, esta investigação é composta por três partes. No enquadramento teórico será referido o significado das observações influentes, a sua importância e a metodologia na sua identificação; numa segunda parte será feita uma análise de vários estudos empíricos na área da contabilidade com o intuito de identificar a metodologia geralmente utilizada na detecção de tais observações; e, finalmente, numa terceira fase pretendemos realizar um estudo empírico que consiste em tratar tecnicamente, segundo a forma sugerida pelos manuais de econometria, as observações influentes e comparar os resultados da estimação do modelo de regressão com aqueles que resultariam se fossem considerados os critérios que tradicionalmente são adoptados para identificar as observações influentes em empirical accounting

    Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

    Get PDF
    International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses

    Measures of influence for the functional linear model with scalar response

    No full text
    This paper studies how to identify influential observations in the functional linear model in which the predictor is functional and the response is scalar. Measurement of the effects of a single observation on estimation and prediction when the model is estimated by the principal components method is undertaken. For that, three statistics are introduced for measuring the influence of each observation on estimation and prediction of the functional linear model with scalar response that are generalizations of the measures proposed for the standard regression model by [D.R. Cook, Detection of influential observations in linear regression, Technometrics 19 (1977) 15-18; D. Peña, A new statistic for influence in linear regression, Technometrics 47 (2005) 1-12] respectively. A smoothed bootstrap method is proposed to estimate the quantiles of the influence measures, which allows us to point out which observations have the larger influence on estimation and prediction. The behavior of the three statistics and the quantile estimation bootstrap based method is analyzed via a simulation study. Finally, the practical use of the proposed statistics is illustrated by the analysis of a real data example, which show that the proposed measures are useful for detecting heterogeneity in the functional linear model with scalar response.62J99 62J05 62H12 Cook's distance Functional linear model Functional principal components Influential observations Pena's distance
    corecore