Search CORE

1,154 research outputs found

The bootstrap -A review

Author: González-Manteiga Wenceslao
Prada Sánchez José Manuel
Romo Juan
Publication venue
Publication date: 01/05/1992
Field of study

The bootstrap, extensively studied during the last decade, has become a powerful tool in different areas of Statistical Inference. In this work, we present the main ideas of bootstrap methodology in several contexts, citing the most relevant contributions and illustrating with examples and simulation studies some interesting aspects

Universidad Carlos III de Madrid e-Archivo

Inference on Counterfactual Distributions

Author: Chernozhukov Victor
Fernandez-Val Ivan
Melly Blaise
Publication venue
Publication date: 13/05/2012
Field of study

Counterfactual distributions are important ingredients for policy analysis and decomposition analysis in empirical economics. In this article we develop modeling and inference tools for counterfactual distributions based on regression methods. The counterfactual scenarios that we consider consist of ceteris paribus changes in either the distribution of covariates related to the outcome of interest or the conditional distribution of the outcome given covariates. For either of these scenarios we derive joint functional central limit theorems and bootstrap validity results for regression-based estimators of the status quo and counterfactual outcome distributions. These results allow us to construct simultaneous confidence sets for function-valued effects of the counterfactual changes, including the effects on the entire distribution and quantile functions of the outcome as well as on related functionals. These confidence sets can be used to test functional hypotheses such as no-effect, positive effect, or stochastic dominance. Our theory applies to general counterfactual changes and covers the main regression methods including classical, quantile, duration, and distribution regressions. We illustrate the results with an empirical application to wage decompositions using data for the United States. As a part of developing the main results, we introduce distribution regression as a comprehensive and flexible tool for modeling and estimating the \textit{entire} conditional distribution. We show that distribution regression encompasses the Cox duration regression and represents a useful alternative to quantile regression. We establish functional central limit theorems and bootstrap validity results for the empirical distribution regression process and various related functionals.Comment: 55 pages, 1 table, 3 figures, supplementary appendix with additional results available from the authors' web site

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Bern Open Repository and Information System (BORIS)

Inference on counterfactual distributions

Author: Blaise Melly
Ivan Fernandez-Val
Victor Chernozhukov
Publication venue
Publication date
Field of study

In this paper we develop procedures for performing inference in regression models about how potential policy interventions affect the entire marginal distribution of an outcome of interest. These policy interventions consist of either changes in the distribution of covariates related to the outcome holding the conditional distribution of the outcome given covariates fixed, or changes in the conditional distribution of the outcome given covariates holding the marginal distribution of the covariates fixed. Under either of these assumptions, we obtain uniformly consistent estimates and functional central limit theorems for the counterfactual and status quo marginal distributions of the outcome as well as other function-valued effects of the policy, including, for example, the effects of the policy on the marginal distribution function, quantile function, and other related functionals. We construct simultaneous confidence sets for these functions; these sets take into account the sampling variation in the estimation of the relationship between the outcome and covariates. Our procedures rely on, and our theory covers, all main regression approaches for modeling and estimating conditional distributions, focusing especially on classical, quantile, duration, and distribution regressions. Our procedures are general and accommodate both simple unitary changes in the values of a given covariate as well as changes in the distribution of the covariates or the conditional distribution of the outcome given covariates of general form. We apply the procedures to examine the effects of labor market institutions on the U.S. wage distribution.

Research Papers in Economics

Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data

Author: Bowker P
Findlow AH
Goulermas JY
Howard D
Nester CJ
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

In the recent years, the use of motion tracking systems for acquisition of functional biomechanical gait data, has received increasing interest due to the richness and accuracy of the measured kinematic information. However, costs frequently restrict the number of subjects employed, and this makes the dimensionality of the collected data far higher than the available samples. This paper applies discriminant analysis algorithms to the classification of patients with different types of foot lesions, in order to establish an association between foot motion and lesion formation. With primary attention to small sample size situations, we compare different types of Bayesian classifiers and evaluate their performance with various dimensionality reduction techniques for feature extraction, as well as search methods for selection of raw kinematic variables. Finally, we propose a novel integrated method which fine-tunes the classifier parameters and selects the most relevant kinematic variables simultaneously. Performance comparisons are using robust resampling techniques such as Bootstrap

632+

and k-fold cross-validation. Results from experimentations with lesion subjects suffering from pathological plantar hyperkeratosis, show that the proposed method can lead to

sim 96%

correct classification rates with less than 10% of the original features

Keele Research Repository

University of Salford Institutional Repository

Crossref

Dawai: an R Package for Discriminant Analysis With Additional Information

Author: Conde del Río David
Fernández Temprano Miguel Alejandro
Rueda Sabater María Cristina
Salvador González Bonifacio
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2015
Field of study

The incorporation of additional information into discriminant rules is receiving in- creasing attention as the rules including this information perform better than the usual rules. In this paper we introduce an R package called dawai, which provides the functions that allow to de ne the rules that take into account this additional information expressed in terms of restrictions on the means, to classify the samples and to evaluate the accuracy of the results. Moreover, in this paper we extend the results and de nitions given in previous papers (Fern andez, Rueda, and Salvador 2006, Conde, Fern andez, Rueda, and Salvador 2012, Conde, Salvador, Rueda, and Fern andez 2013) to the case of unequal co- variances among the populations, and consequently de ne the corresponding restricted quadratic discriminant rules. We also de ne estimators of the accuracy of the rules for the general more than two populations case. The wide range of applications of these pro- cedures is illustrated with two data sets from two di erent elds, i.e., biology and pattern recognition.Spanish Ministerio de Ciencia e Innovación (MTM2012-37129

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Documental de la Universidad de Valladolid

An Object-Oriented Framework for Robust Multivariate Analysis

Author: Peter Filzmoser
Valentin Todorov
Publication venue
Publication date
Field of study

Taking advantage of the S4 class system of the programming environment R, which facilitates the creation and maintenance of reusable and modular components, an object-oriented framework for robust multivariate analysis was developed. The framework resides in the packages robustbase and rrcov and includes an almost complete set of algorithms for computing robust multivariate location and scatter, various robust methods for principal component analysis as well as robust linear and quadratic discriminant analysis. The design of these methods follows common patterns which we call statistical design patterns in analogy to the design patterns widely used in software engineering. The application of the framework to data analysis as well as possible extensions by the development of new methods is demonstrated on examples which themselves are part of the package rrcov.

Research Papers in Economics

Accuracy of logistic models and receiver operating characteristic curves

Author: Corbett Philip
Publication venue
Publication date
Field of study

The accuracy of prediction is a commonly studied topic in modern statistics. The performance of a predictor is becoming increasingly more important as real-life decisions axe made on the basis of prediction. In this thesis we investigate the prediction accuracy of logistic models from two different approaches. Logistic regression is often used to discriminate between two groups or populations based on a number of covariates. The receiver operating characteristic (ROC) curve is a commonly used tool (especially in medical statistics) to assess the performance Of such a score or test. By using the same data to fit the logistic regression and calculate the ROC curve we overestimate the performance that the score would give if validated on a sample of future cases. This overestimation is studied and we propose a correction for the ROC curve and the area under the curve. The methods axe illustrated through way of two medical examples and a simulation study, and we show that the overestimation can be quite substantial for small sample sizes. The idea of shrinkage pertains to the notion that by including some prior information about the data under study we can improve prediction. Until now, the study of shrinkage has almost exclusively been concentrated on continuous measurements. We propose a methodology to study shrinkage for logistic regression modelling of categorical data with a binary response. Categorical data with a large number of levels is often grouped for modelling purposes, which discards useful information about the data. By using this information we can apply Bayesian methods to update model parameters and show through examples and simulations that in some circumstances the updated estimates are better predictors than the model

Warwick Research Archives Portal Repository

Inference on counterfactual distributions

Author: Blaise Melly
Iván Fernández-Val
Victor Chernozhukov
Publication venue
Publication date
Field of study

We develop inference procedures for policy analysis based on regression methods. We consider policy interventions that correspond to either changes in the distribution of covariates, or changes in the conditional distribution of the outcome given covariates, or both. Under either of these policy scenarios, we derive functional central limit theorems for regression-based estimators of the status quo and counterfactual marginal distributions. This result allows us to construct simultaneous conﬁdence sets for function-valued policy eﬀects, including the eﬀects on the marginal distribution function, quantile function, and other related functionals. We use these conﬁdence sets to test functional hypotheses such as no-eﬀect, positive eﬀect, or stochastic dominance. Our theory applies to general policy interventions and covers the main regression methods including classical, quantile, duration, and distribution regressions. We illustrate the results with an empirical application on wage decompositions using data for the United States. Of independent interest is the use of distribution regression as a tool for modeling the entire conditional distribution, encompassing duration/transformation regression, and representing an alternative to quantile regression. This is a revision of CWP09/09.

Research Papers in Economics

Wrapper algorithms and their performance assessment on high-dimensional molecular data

Author: Bernau Christoph Michael
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 08/08/2014
Field of study

Prediction problems on high-dimensional molecular data, e.g. the classification of microar- ray samples into normal and cancer tissues, are complex and ill-posed since the number of variables usually exceeds the number of observations by orders of magnitude. Recent research in the area has propagated a variety of new statistical models in order to handle these new biological datasets. In practice, however, these models are always applied in combination with preprocessing and variable selection methods as well as model selection which is mostly performed by cross-validation. Varma and Simon (2006) have used the term ‘wrapper-algorithm’ for this integration of preprocessing and model selection into the construction of statistical models. Additionally, they have proposed the method of nested cross-validation (NCV) as a way of estimating their prediction error which has evolved to the gold-standard by now. In the first part, this thesis provides further theoretical and empirical justification for the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less intensive alternative to NCV is proposed which can be motivated in a decision theoretic framework. The new method can be interpreted as a smoothed variant of NCV and, in contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error. The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is proposed as an alternative concept to the repetition of separated within-study-validations if several similar prediction problems are available. The concept is demonstrated using six different wrapper algorithms for survival prediction on censored data on a selection of eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating realistic data from such related prediction problems is described and subsequently applied to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms. Eventually, the last part approaches computational aspects of the analyses and simula- tions performed in the thesis. The preprocessing before the analysis as well as the evaluation of the prediction models requires the usage of large computing resources. Parallel comput- ing approaches are illustrated on cluster, cloud and high performance computing resources using the R programming language. Usage of heterogeneous hardware and processing of large datasets are covered as well as the implementation of the R-package survHD for the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction from censored data.Prädiktionsprobleme für hochdimensionale genetische Daten, z.B. die Klassifikation von Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl der Variablen die Anzahl der Beobachtungen um ein Vielfaches übersteigt. Die Forschung hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth- oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt, wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgeführt werden. Varma und Simon (2006) haben den Begriff ’Wrapper-Algorithmus’ für eine derartige Einbet- tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode zur Sch ̈atzung ihrer Fehlerrate eingeführt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische Grundlage sowie eine empirische Rechtfertigung für die Anwendung von NCV bei solchen ’Wrapper-Algorithmen’ vorgestellt. Außerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert wird. Diese neue Methode kann als eine gegl ̈attete Variante von NCV interpretiert wer- den und hält im Gegensatz zu NCV intuitive Grenzen bei der Fehlerratenschätzung ein. Der zweite Teil behandelt den Vergleich verschiedener ’Wrapper-Algorithmen’ bzw. das Sch ̈atzen ihrer Reihenfolge gem ̈aß eines bestimmten Gütekriteriums. Als eine Alterna- tive zur wiederholten Durchführung von Kreuzvalidierung auf einzelnen Datensätzen wird das Konzept der studienübergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen ’Wrapper-Algorithmen’ für die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. Zusätzlich wird ein Bootstrapverfahren beschrieben, mit dessen Hilfe man mehrere realistische Datens ̈atze aus einer Menge von solchen verwandten Prädiktionsproblemen generieren kann. Der letzte Teil beleuchtet schließlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der Prädiktionsmodelle erfordert die extensive Nutzung von Computerressourcen. Es werden Ansätze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen- ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung von heterogenen Hardwarearchitekturen, die Verarbeitung von großen Datensätzen sowie die Entwicklung des R-Pakets survHD für die Analyse und Evaluierung von ’Wrapper- Algorithmen’ zur Uberlebenszeitenanalyse werden thematisiert