3,132 research outputs found
Estimating of bootstrap confidence intervals for freight transport matrices
Freight transport studies require, as a preliminary step, a survey to be conducted on a sample of the universe of agents,
vehicles and/or companies of the transportation system. The statistical reliability of the data determines the goodness of the
outcomes and conclusions that can be inferred from the analyses and models generated.
The methodology contained herein, based on bootstrapping techniques, allows us to generate the confidence intervals of
origin-destination pairs defined by each cell of the matrix derived from a freight transport survey. To address this study a data
set from a statistically reliable freight transport study conducted in Spain at the level of multi-province inter-regions has been
used.Public Road Agency of the Andalusian Regional Government (AOP-JA, Spain Project G-GI3000/IDII)EU FEDE
Estimating of bootstrap confidence intervals for freight transport matrices
Freight transport studies require, as a preliminary step, a survey to be conducted on a sample of the universe of agents,
vehicles and/or companies of the transportation system. The statistical reliability of the data determines the goodness of the
outcomes and conclusions that can be inferred from the analyses and models generated.
The methodology contained herein, based on bootstrapping techniques, allows us to generate the confidence intervals of
origin-destination pairs defined by each cell of the matrix derived from a freight transport survey. To address this study a data
set from a statistically reliable freight transport study conducted in Spain at the level of multi-province inter-regions has been
used.Public Road Agency of the Andalusian Regional Government (AOP-JA, Spain Project G-GI3000/IDII)EU FEDE
Confidence Intervals for Unobserved Events
Consider a finite sample from an unknown distribution over a countable
alphabet. Unobserved events are alphabet symbols which do not appear in the
sample. Estimating the probabilities of unobserved events is a basic problem in
statistics and related fields, which was extensively studied in the context of
point estimation. In this work we introduce a novel interval estimation scheme
for unobserved events. Our proposed framework applies selective inference, as
we construct confidence intervals (CIs) for the desired set of parameters.
Interestingly, we show that obtained CIs are dimension-free, as they do not
grow with the alphabet size. Further, we show that these CIs are (almost)
tight, in the sense that they cannot be further improved without violating the
prescribed coverage rate. We demonstrate the performance of our proposed scheme
in synthetic and real-world experiments, showing a significant improvement over
the alternatives. Finally, we apply our proposed scheme to large alphabet
modeling. We introduce a novel simultaneous CI scheme for large alphabet
distributions which outperforms currently known methods while maintaining the
prescribed coverage rate
How many clones need to be sequenced from a single forensic or ancient DNA sample in order to determine a reliable consensus sequence?
Forensic and ancient DNA (aDNA) extracts are mixtures of endogenous aDNA, existing in more or less damaged state, and contaminant DNA. To obtain the true aDNA sequence, it is not sufficient to generate a single direct sequence of the mixture, even where the authentic aDNA is the most abundant (e.g. 25% or more) in the component mixture. Only bacterial cloning can elucidate the components of this mixture. We calculate the number of clones that need to be sampled (for various mixture ratios) in order to be confident (at various levels of confidence) to have identified the major component. We demonstrate that to be >95% confident of identifying the most abundant sequence present at 70% in the ancient sample, 20 clones must be sampled. We make recommendations and offer a free-access web-based program, which constructs the most reliable consensus sequence from the user's input clone sequences and analyses the confidence limits for each nucleotide position and for the whole consensus sequence. Accepted authentication methods must be employed in order to assess the authenticity and endogeneity of the resulting consensus sequences (e.g. quantification and replication by another laboratory, blind testing, amelogenin sex versus morphological sex, the effective use of controls, etc.) and determine whether they are indeed aDNA
Statistical Inference on Optimal Points to Evaluate Multi-State Classification Systems
In decision making, an optimal point represents the settings for which a classification system should be operated to achieve maximum performance. Clearly, these optimal points are of great importance in classification theory. Not only is the selection of the optimal point of interest, but quantifying the uncertainty in the optimal point and its performance is also important. The Youden index is a metric currently employed for selection and performance quantification of optimal points for classification system families. The Youden index quantifies the correct classification rates of a classification system, and its confidence interval quantifies the uncertainty in this measurement. This metric currently focuses on two or three classes, and only allows for the utility of correct classifications and the cost of total misclassifications to be considered. An alternative to this metric for three or more classes is a cost function which considers the sum of incorrect classification rates. This new metric is preferable as it can include class prevalences and costs associated with every classification. In multi-class settings this informs better decisions and inferences on optimal points. The work in this dissertation develops theory and methods for confidence intervals on a metric based on misclassfication rates, Bayes Cost, and where possible, the thresholds found for an optimal point using Bayes Cost. Hypothesis tests for Bayes Cost are also developed to test a classification systems performance or compare systems with an emphasis on classification systems involving three or more classes. Performance of the newly proposed methods is demonstrated with simulation
Recommended from our members
Flexible Models for Competing Risks and Weighted Analyses of Composite Endpoints
In many clinical studies the occurrence of different types of disease events over time is of interest. For example, in cardiovascular studies, disease events such as death, stroke or myocardial infarction are of interest. As another example, in central nervous system infections such as cryptococcal meningitis, unfavourable events such as death or neurological events and favourable events such as coma or fungal clearance are relevant. In statistical terminology, competing risks refer to data where the time and type of the first disease event are analysed. Such data arise naturally if a nonfatal disease event is of interest but is precluded by death in a substantial proportion of subjects. Competing risks are the topic of the first four chapters of this thesis. An alternative approach used in many randomized controlled clinical trials is to combine different harmful events to a single composite endpoint. The analysis of trials with a composite endpoints is the topic of the fifth chapter. This thesis is organised as follows:
Chapters 1 and 2 are introductory chapters and provide an overview of statistical approaches to competing risks and semi-nonparametric (SNP) density estimation. Two concepts that form the basis for the work in Chapters 3 and 4 are introduced here: the cumulative incidence function (CIF) and SNP densities. For competing risks data, the CIF describes the absolute risk of different event types depending on time and is the most important quantity for data description, prognostic modelling, and medical decision making. SNP densities are densities that can be expressed as the product of a squared polynomial (of variable degree) and a base density which is chosen as the standard normal or the exponential density in this work.
Chapter 3 presents a novel approach to CIF-estimation. The underlying statistical model is specified via a mixture factorization of the joint distribution of the event type and time and the time to event distributions conditional on the event type are modelled using SNP densities. One key strength of the approach is that it can handle arbitrary censoring and truncation. A stepwise forward algorithm for model estimation and adaptive selection of SNP polynomial degrees is presented, implemented in the statistical software R, evaluated in a sequence of simulation studies, and applied to data sets from clinical trials in central nervous system infections. The simulations demonstrate that the SNP approach frequently outperforms both parametric and nonparametric alternatives. They also support the use of “ad hoc” asymptotic inference to derive confidence intervals despite a lack of a formal mathematical verification for the relevant asymptotic properties.
Chapter 4 extends the work of Chapter 3 to regression modelling, i.e. the quantification of cov-ariate effects on the CIF. A careful discussion of interpretational and identifiability issues which are intrinsic to models based on the mixture factorization is provided and the usage of the model is only recommended in settings with sufficient follow-up relative to the timing of the events. A simulation study demonstrates that the proposed approach is competitive compared to common statistical models for competing risks in terms of accuracy of parameter estimates and predictions. However, it also shows that “ad hoc” asymptotic inference is only valid if sample size is large. The chapter also provides a suggestion for model diagnostics of the proposed model, an area that has been somewhat neglected for competing risks data.
Chapter 5 discusses the analysis of composite endpoints. A common critique of traditional analyses of composite endpoints is that all disease events are equally weighted whereas their clinical relevance may differ substantially. This chapter addresses this by introducing a framework for the weighted analysis of composite endpoints that handles both binary and time-to-event data. To address the difficulty in selecting an exact set of weights, it proposes a method for constructing simultaneous confidence intervals and tests that protect the familywise type I error in the strong sense across families of weights which satisfy flexible inequality and order constraints based on the theory of χ-2-distributions. It is then demonstrated in several simulation scenarios as well as applications that the proposed method achieves the nominal simultaneous overall coverage rate with lower efficiency loss compared to the standard Scheffe’s procedure.
Final remarks are given in Chapter 6 together with an outlook for potential future research directions
Multilevel modelling of refusal and noncontact nonresponse in household surveys: evidence from six UK government surveys
This paper analyses household unit nonresponse and interviewer effects in six major UK government surveys using a multilevel multinomial modelling approach. The models are guided by current conceptual frameworks and theories of survey participation. One key feature of the analysis is the investigation of survey dependent and independent effects of household and interviewer characteristics, providing an empirical exploration of the leverage-salience theory. The analysis is based on the 2001 UK Census Link Study, a unique data source containing an unusually rich set of auxiliary variables, linking the response outcome of six surveys to census data, interviewer observation data and interviewer information, available for respondents and nonrespondents
- …