1,154 research outputs found
The bootstrap -A review
The bootstrap, extensively studied during the last decade, has become a powerful tool in different areas of Statistical Inference. In this work, we present the main ideas of bootstrap methodology in several contexts, citing the most relevant contributions and illustrating with examples and simulation studies some interesting aspects
Inference on Counterfactual Distributions
Counterfactual distributions are important ingredients for policy analysis
and decomposition analysis in empirical economics. In this article we develop
modeling and inference tools for counterfactual distributions based on
regression methods. The counterfactual scenarios that we consider consist of
ceteris paribus changes in either the distribution of covariates related to the
outcome of interest or the conditional distribution of the outcome given
covariates. For either of these scenarios we derive joint functional central
limit theorems and bootstrap validity results for regression-based estimators
of the status quo and counterfactual outcome distributions. These results allow
us to construct simultaneous confidence sets for function-valued effects of the
counterfactual changes, including the effects on the entire distribution and
quantile functions of the outcome as well as on related functionals. These
confidence sets can be used to test functional hypotheses such as no-effect,
positive effect, or stochastic dominance. Our theory applies to general
counterfactual changes and covers the main regression methods including
classical, quantile, duration, and distribution regressions. We illustrate the
results with an empirical application to wage decompositions using data for the
United States.
As a part of developing the main results, we introduce distribution
regression as a comprehensive and flexible tool for modeling and estimating the
\textit{entire} conditional distribution. We show that distribution regression
encompasses the Cox duration regression and represents a useful alternative to
quantile regression. We establish functional central limit theorems and
bootstrap validity results for the empirical distribution regression process
and various related functionals.Comment: 55 pages, 1 table, 3 figures, supplementary appendix with additional
results available from the authors' web site
Inference on counterfactual distributions
In this paper we develop procedures for performing inference in regression models about how potential policy interventions affect the entire marginal distribution of an outcome of interest. These policy interventions consist of either changes in the distribution of covariates related to the outcome holding the conditional distribution of the outcome given covariates fixed, or changes in the conditional distribution of the outcome given covariates holding the marginal distribution of the covariates fixed. Under either of these assumptions, we obtain uniformly consistent estimates and functional central limit theorems for the counterfactual and status quo marginal distributions of the outcome as well as other function-valued effects of the policy, including, for example, the effects of the policy on the marginal distribution function, quantile function, and other related functionals. We construct simultaneous confidence sets for these functions; these sets take into account the sampling variation in the estimation of the relationship between the outcome and covariates. Our procedures rely on, and our theory covers, all main regression approaches for modeling and estimating conditional distributions, focusing especially on classical, quantile, duration, and distribution regressions. Our procedures are general and accommodate both simple unitary changes in the values of a given covariate as well as changes in the distribution of the covariates or the conditional distribution of the outcome given covariates of general form. We apply the procedures to examine the effects of labor market institutions on the U.S. wage distribution.
Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data
In the recent years, the use of motion tracking systems for acquisition of functional biomechanical gait data, has received increasing interest due to the richness and accuracy of the measured kinematic information. However, costs frequently restrict the number of subjects employed, and this makes the dimensionality of the collected data far higher than the available samples. This paper applies discriminant analysis algorithms to the classification of patients with different types of foot lesions, in order to establish an association between foot motion and lesion formation. With primary attention to small sample size situations, we compare different types of Bayesian classifiers and evaluate their performance with various dimensionality reduction techniques for feature extraction, as well as search methods for selection of raw kinematic variables. Finally, we propose a novel integrated method which fine-tunes the classifier parameters and selects the most relevant kinematic variables simultaneously. Performance comparisons are using robust resampling techniques such as Bootstrapand k-fold cross-validation. Results from experimentations with lesion subjects suffering from pathological plantar hyperkeratosis, show that the proposed method can lead tocorrect classification rates with less than 10% of the original features
Dawai: an R Package for Discriminant Analysis With Additional Information
The incorporation of additional information into discriminant rules is receiving in-
creasing attention as the rules including this information perform better than the usual
rules. In this paper we introduce an R package called dawai, which provides the functions
that allow to de ne the rules that take into account this additional information expressed
in terms of restrictions on the means, to classify the samples and to evaluate the accuracy
of the results. Moreover, in this paper we extend the results and de nitions given in
previous papers (Fern andez, Rueda, and Salvador 2006, Conde, Fern andez, Rueda, and
Salvador 2012, Conde, Salvador, Rueda, and Fern andez 2013) to the case of unequal co-
variances among the populations, and consequently de ne the corresponding restricted
quadratic discriminant rules. We also de ne estimators of the accuracy of the rules for
the general more than two populations case. The wide range of applications of these pro-
cedures is illustrated with two data sets from two di erent elds, i.e., biology and pattern
recognition.Spanish Ministerio de Ciencia e InnovaciĆ³n (MTM2012-37129
An Object-Oriented Framework for Robust Multivariate Analysis
Taking advantage of the S4 class system of the programming environment R, which facilitates the creation and maintenance of reusable and modular components, an object-oriented framework for robust multivariate analysis was developed. The framework resides in the packages robustbase and rrcov and includes an almost complete set of algorithms for computing robust multivariate location and scatter, various robust methods for principal component analysis as well as robust linear and quadratic discriminant analysis. The design of these methods follows common patterns which we call statistical design patterns in analogy to the design patterns widely used in software engineering. The application of the framework to data analysis as well as possible extensions by the development of new methods is demonstrated on examples which themselves are part of the package rrcov.
Accuracy of logistic models and receiver operating characteristic curves
The accuracy of prediction is a commonly studied topic in modern statistics. The
performance of a predictor is becoming increasingly more important as real-life
decisions axe made on the basis of prediction. In this thesis we investigate the
prediction accuracy of logistic models from two different approaches.
Logistic regression is often used to discriminate between two groups or populations
based on a number of covariates. The receiver operating characteristic
(ROC) curve is a commonly used tool (especially in medical statistics) to assess
the performance Of such a score or test. By using the same data to fit the logistic
regression and calculate the ROC curve we overestimate the performance that
the score would give if validated on a sample of future cases. This overestimation
is studied and we propose a correction for the ROC curve and the area under the
curve. The methods axe illustrated through way of two medical examples and a
simulation study, and we show that the overestimation can be quite substantial
for small sample sizes.
The idea of shrinkage pertains to the notion that by including some prior information
about the data under study we can improve prediction. Until now,
the study of shrinkage has almost exclusively been concentrated on continuous
measurements. We propose a methodology to study shrinkage for logistic regression
modelling of categorical data with a binary response. Categorical data
with a large number of levels is often grouped for modelling purposes, which
discards useful information about the data. By using this information we can
apply Bayesian methods to update model parameters and show through examples
and simulations that in some circumstances the updated estimates are better
predictors than the model
Inference on counterfactual distributions
We develop inference procedures for policy analysis based on regression methods. We consider policy interventions that correspond to either changes in the distribution of covariates, or changes in the conditional distribution of the outcome given covariates, or both. Under either of these policy scenarios, we derive functional central limit theorems for regression-based estimators of the status quo and counterfactual marginal distributions. This result allows us to construct simultaneous conļ¬dence sets for function-valued policy eļ¬ects, including the eļ¬ects on the marginal distribution function, quantile function, and other related functionals. We use these conļ¬dence sets to test functional hypotheses such as no-eļ¬ect, positive eļ¬ect, or stochastic dominance. Our theory applies to general policy interventions and covers the main regression methods including classical, quantile, duration, and distribution regressions. We illustrate the results with an empirical application on wage decompositions using data for the United States. Of independent interest is the use of distribution regression as a tool for modeling the entire conditional distribution, encompassing duration/transformation regression, and representing an alternative to quantile regression. This is a revision of CWP09/09.
Wrapper algorithms and their performance assessment on high-dimensional molecular data
Prediction problems on high-dimensional molecular data, e.g. the classification of microar-
ray samples into normal and cancer tissues, are complex and ill-posed since the number
of variables usually exceeds the number of observations by orders of magnitude. Recent
research in the area has propagated a variety of new statistical models in order to handle
these new biological datasets. In practice, however, these models are always applied in
combination with preprocessing and variable selection methods as well as model selection
which is mostly performed by cross-validation. Varma and Simon (2006) have used the
term āwrapper-algorithmā for this integration of preprocessing and model selection into the
construction of statistical models. Additionally, they have proposed the method of nested
cross-validation (NCV) as a way of estimating their prediction error which has evolved to
the gold-standard by now.
In the first part, this thesis provides further theoretical and empirical justification for
the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less
intensive alternative to NCV is proposed which can be motivated in a decision theoretic
framework. The new method can be interpreted as a smoothed variant of NCV and, in
contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error.
The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is
proposed as an alternative concept to the repetition of separated within-study-validations
if several similar prediction problems are available. The concept is demonstrated using
six different wrapper algorithms for survival prediction on censored data on a selection of
eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating
realistic data from such related prediction problems is described and subsequently applied
to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms.
Eventually, the last part approaches computational aspects of the analyses and simula-
tions performed in the thesis. The preprocessing before the analysis as well as the evaluation
of the prediction models requires the usage of large computing resources. Parallel comput-
ing approaches are illustrated on cluster, cloud and high performance computing resources
using the R programming language. Usage of heterogeneous hardware and processing of
large datasets are covered as well as the implementation of the R-package survHD for
the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction
from censored data.PrƤdiktionsprobleme fĆ¼r hochdimensionale genetische Daten, z.B. die Klassifikation von
Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl
der Variablen die Anzahl der Beobachtungen um ein Vielfaches Ć¼bersteigt. Die Forschung
hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth-
oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt,
wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgefĆ¼hrt werden. Varma
und Simon (2006) haben den Begriff āWrapper-Algorithmusā fĆ¼r eine derartige Einbet-
tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode
verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode
zur Sch Ģatzung ihrer Fehlerrate eingefĆ¼hrt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische
Grundlage sowie eine empirische Rechtfertigung fĆ¼r die Anwendung von NCV bei solchen
āWrapper-Algorithmenā vorgestellt. AuĆerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert
wird. Diese neue Methode kann als eine gegl Ģattete Variante von NCV interpretiert wer-
den und hƤlt im Gegensatz zu NCV intuitive Grenzen bei der FehlerratenschƤtzung ein.
Der zweite Teil behandelt den Vergleich verschiedener āWrapper-Algorithmenā bzw. das
Sch Ģatzen ihrer Reihenfolge gem ĢaĆ eines bestimmten GĆ¼tekriteriums. Als eine Alterna-
tive zur wiederholten DurchfĆ¼hrung von Kreuzvalidierung auf einzelnen DatensƤtzen wird
das Konzept der studienĆ¼bergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen āWrapper-Algorithmenā fĆ¼r die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. ZusƤtzlich wird ein Bootstrapverfahren
beschrieben, mit dessen Hilfe man mehrere realistische Datens Ģatze aus einer Menge von
solchen verwandten PrƤdiktionsproblemen generieren kann. Der letzte Teil beleuchtet
schlieĆlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der PrƤdiktionsmodelle erfordert die extensive Nutzung von Computerressourcen.
Es werden AnsƤtze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen-
ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung
von heterogenen Hardwarearchitekturen, die Verarbeitung von groĆen DatensƤtzen sowie
die Entwicklung des R-Pakets survHD fĆ¼r die Analyse und Evaluierung von āWrapper-
Algorithmenā zur Uberlebenszeitenanalyse
werden thematisiert
- ā¦