1,154 research outputs found

    The bootstrap -A review

    Get PDF
    The bootstrap, extensively studied during the last decade, has become a powerful tool in different areas of Statistical Inference. In this work, we present the main ideas of bootstrap methodology in several contexts, citing the most relevant contributions and illustrating with examples and simulation studies some interesting aspects

    Inference on Counterfactual Distributions

    Get PDF
    Counterfactual distributions are important ingredients for policy analysis and decomposition analysis in empirical economics. In this article we develop modeling and inference tools for counterfactual distributions based on regression methods. The counterfactual scenarios that we consider consist of ceteris paribus changes in either the distribution of covariates related to the outcome of interest or the conditional distribution of the outcome given covariates. For either of these scenarios we derive joint functional central limit theorems and bootstrap validity results for regression-based estimators of the status quo and counterfactual outcome distributions. These results allow us to construct simultaneous confidence sets for function-valued effects of the counterfactual changes, including the effects on the entire distribution and quantile functions of the outcome as well as on related functionals. These confidence sets can be used to test functional hypotheses such as no-effect, positive effect, or stochastic dominance. Our theory applies to general counterfactual changes and covers the main regression methods including classical, quantile, duration, and distribution regressions. We illustrate the results with an empirical application to wage decompositions using data for the United States. As a part of developing the main results, we introduce distribution regression as a comprehensive and flexible tool for modeling and estimating the \textit{entire} conditional distribution. We show that distribution regression encompasses the Cox duration regression and represents a useful alternative to quantile regression. We establish functional central limit theorems and bootstrap validity results for the empirical distribution regression process and various related functionals.Comment: 55 pages, 1 table, 3 figures, supplementary appendix with additional results available from the authors' web site

    Inference on counterfactual distributions

    Get PDF
    In this paper we develop procedures for performing inference in regression models about how potential policy interventions affect the entire marginal distribution of an outcome of interest. These policy interventions consist of either changes in the distribution of covariates related to the outcome holding the conditional distribution of the outcome given covariates fixed, or changes in the conditional distribution of the outcome given covariates holding the marginal distribution of the covariates fixed. Under either of these assumptions, we obtain uniformly consistent estimates and functional central limit theorems for the counterfactual and status quo marginal distributions of the outcome as well as other function-valued effects of the policy, including, for example, the effects of the policy on the marginal distribution function, quantile function, and other related functionals. We construct simultaneous confidence sets for these functions; these sets take into account the sampling variation in the estimation of the relationship between the outcome and covariates. Our procedures rely on, and our theory covers, all main regression approaches for modeling and estimating conditional distributions, focusing especially on classical, quantile, duration, and distribution regressions. Our procedures are general and accommodate both simple unitary changes in the values of a given covariate as well as changes in the distribution of the covariates or the conditional distribution of the outcome given covariates of general form. We apply the procedures to examine the effects of labor market institutions on the U.S. wage distribution.

    Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data

    Get PDF
    In the recent years, the use of motion tracking systems for acquisition of functional biomechanical gait data, has received increasing interest due to the richness and accuracy of the measured kinematic information. However, costs frequently restrict the number of subjects employed, and this makes the dimensionality of the collected data far higher than the available samples. This paper applies discriminant analysis algorithms to the classification of patients with different types of foot lesions, in order to establish an association between foot motion and lesion formation. With primary attention to small sample size situations, we compare different types of Bayesian classifiers and evaluate their performance with various dimensionality reduction techniques for feature extraction, as well as search methods for selection of raw kinematic variables. Finally, we propose a novel integrated method which fine-tunes the classifier parameters and selects the most relevant kinematic variables simultaneously. Performance comparisons are using robust resampling techniques such as Bootstrap632+632+and k-fold cross-validation. Results from experimentations with lesion subjects suffering from pathological plantar hyperkeratosis, show that the proposed method can lead tosim96sim 96%correct classification rates with less than 10% of the original features

    Dawai: an R Package for Discriminant Analysis With Additional Information

    Get PDF
    The incorporation of additional information into discriminant rules is receiving in- creasing attention as the rules including this information perform better than the usual rules. In this paper we introduce an R package called dawai, which provides the functions that allow to de ne the rules that take into account this additional information expressed in terms of restrictions on the means, to classify the samples and to evaluate the accuracy of the results. Moreover, in this paper we extend the results and de nitions given in previous papers (Fern andez, Rueda, and Salvador 2006, Conde, Fern andez, Rueda, and Salvador 2012, Conde, Salvador, Rueda, and Fern andez 2013) to the case of unequal co- variances among the populations, and consequently de ne the corresponding restricted quadratic discriminant rules. We also de ne estimators of the accuracy of the rules for the general more than two populations case. The wide range of applications of these pro- cedures is illustrated with two data sets from two di erent elds, i.e., biology and pattern recognition.Spanish Ministerio de Ciencia e InnovaciĆ³n (MTM2012-37129

    An Object-Oriented Framework for Robust Multivariate Analysis

    Get PDF
    Taking advantage of the S4 class system of the programming environment R, which facilitates the creation and maintenance of reusable and modular components, an object-oriented framework for robust multivariate analysis was developed. The framework resides in the packages robustbase and rrcov and includes an almost complete set of algorithms for computing robust multivariate location and scatter, various robust methods for principal component analysis as well as robust linear and quadratic discriminant analysis. The design of these methods follows common patterns which we call statistical design patterns in analogy to the design patterns widely used in software engineering. The application of the framework to data analysis as well as possible extensions by the development of new methods is demonstrated on examples which themselves are part of the package rrcov.

    Accuracy of logistic models and receiver operating characteristic curves

    Get PDF
    The accuracy of prediction is a commonly studied topic in modern statistics. The performance of a predictor is becoming increasingly more important as real-life decisions axe made on the basis of prediction. In this thesis we investigate the prediction accuracy of logistic models from two different approaches. Logistic regression is often used to discriminate between two groups or populations based on a number of covariates. The receiver operating characteristic (ROC) curve is a commonly used tool (especially in medical statistics) to assess the performance Of such a score or test. By using the same data to fit the logistic regression and calculate the ROC curve we overestimate the performance that the score would give if validated on a sample of future cases. This overestimation is studied and we propose a correction for the ROC curve and the area under the curve. The methods axe illustrated through way of two medical examples and a simulation study, and we show that the overestimation can be quite substantial for small sample sizes. The idea of shrinkage pertains to the notion that by including some prior information about the data under study we can improve prediction. Until now, the study of shrinkage has almost exclusively been concentrated on continuous measurements. We propose a methodology to study shrinkage for logistic regression modelling of categorical data with a binary response. Categorical data with a large number of levels is often grouped for modelling purposes, which discards useful information about the data. By using this information we can apply Bayesian methods to update model parameters and show through examples and simulations that in some circumstances the updated estimates are better predictors than the model

    Inference on counterfactual distributions

    Get PDF
    We develop inference procedures for policy analysis based on regression methods. We consider policy interventions that correspond to either changes in the distribution of covariates, or changes in the conditional distribution of the outcome given covariates, or both. Under either of these policy scenarios, we derive functional central limit theorems for regression-based estimators of the status quo and counterfactual marginal distributions. This result allows us to construct simultaneous conļ¬dence sets for function-valued policy eļ¬€ects, including the eļ¬€ects on the marginal distribution function, quantile function, and other related functionals. We use these conļ¬dence sets to test functional hypotheses such as no-eļ¬€ect, positive eļ¬€ect, or stochastic dominance. Our theory applies to general policy interventions and covers the main regression methods including classical, quantile, duration, and distribution regressions. We illustrate the results with an empirical application on wage decompositions using data for the United States. Of independent interest is the use of distribution regression as a tool for modeling the entire conditional distribution, encompassing duration/transformation regression, and representing an alternative to quantile regression. This is a revision of CWP09/09.

    Wrapper algorithms and their performance assessment on high-dimensional molecular data

    Get PDF
    Prediction problems on high-dimensional molecular data, e.g. the classification of microar- ray samples into normal and cancer tissues, are complex and ill-posed since the number of variables usually exceeds the number of observations by orders of magnitude. Recent research in the area has propagated a variety of new statistical models in order to handle these new biological datasets. In practice, however, these models are always applied in combination with preprocessing and variable selection methods as well as model selection which is mostly performed by cross-validation. Varma and Simon (2006) have used the term ā€˜wrapper-algorithmā€™ for this integration of preprocessing and model selection into the construction of statistical models. Additionally, they have proposed the method of nested cross-validation (NCV) as a way of estimating their prediction error which has evolved to the gold-standard by now. In the first part, this thesis provides further theoretical and empirical justification for the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less intensive alternative to NCV is proposed which can be motivated in a decision theoretic framework. The new method can be interpreted as a smoothed variant of NCV and, in contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error. The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is proposed as an alternative concept to the repetition of separated within-study-validations if several similar prediction problems are available. The concept is demonstrated using six different wrapper algorithms for survival prediction on censored data on a selection of eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating realistic data from such related prediction problems is described and subsequently applied to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms. Eventually, the last part approaches computational aspects of the analyses and simula- tions performed in the thesis. The preprocessing before the analysis as well as the evaluation of the prediction models requires the usage of large computing resources. Parallel comput- ing approaches are illustrated on cluster, cloud and high performance computing resources using the R programming language. Usage of heterogeneous hardware and processing of large datasets are covered as well as the implementation of the R-package survHD for the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction from censored data.PrƤdiktionsprobleme fĆ¼r hochdimensionale genetische Daten, z.B. die Klassifikation von Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl der Variablen die Anzahl der Beobachtungen um ein Vielfaches Ć¼bersteigt. Die Forschung hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth- oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt, wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgefĆ¼hrt werden. Varma und Simon (2006) haben den Begriff ā€™Wrapper-Algorithmusā€™ fĆ¼r eine derartige Einbet- tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode zur Sch Ģˆatzung ihrer Fehlerrate eingefĆ¼hrt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische Grundlage sowie eine empirische Rechtfertigung fĆ¼r die Anwendung von NCV bei solchen ā€™Wrapper-Algorithmenā€™ vorgestellt. AuƟerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert wird. Diese neue Methode kann als eine gegl Ģˆattete Variante von NCV interpretiert wer- den und hƤlt im Gegensatz zu NCV intuitive Grenzen bei der FehlerratenschƤtzung ein. Der zweite Teil behandelt den Vergleich verschiedener ā€™Wrapper-Algorithmenā€™ bzw. das Sch Ģˆatzen ihrer Reihenfolge gem ĢˆaƟ eines bestimmten GĆ¼tekriteriums. Als eine Alterna- tive zur wiederholten DurchfĆ¼hrung von Kreuzvalidierung auf einzelnen DatensƤtzen wird das Konzept der studienĆ¼bergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen ā€™Wrapper-Algorithmenā€™ fĆ¼r die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. ZusƤtzlich wird ein Bootstrapverfahren beschrieben, mit dessen Hilfe man mehrere realistische Datens Ģˆatze aus einer Menge von solchen verwandten PrƤdiktionsproblemen generieren kann. Der letzte Teil beleuchtet schlieƟlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der PrƤdiktionsmodelle erfordert die extensive Nutzung von Computerressourcen. Es werden AnsƤtze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen- ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung von heterogenen Hardwarearchitekturen, die Verarbeitung von groƟen DatensƤtzen sowie die Entwicklung des R-Pakets survHD fĆ¼r die Analyse und Evaluierung von ā€™Wrapper- Algorithmenā€™ zur Uberlebenszeitenanalyse werden thematisiert
    • ā€¦
    corecore