338,704 research outputs found
On the correct interpretation of p values and the importance of random variables
The p value is the probability under the null hypothesis of obtaining an experimental result that is at least as extreme as the one that we have actually obtained. That probability plays a crucial role in frequentist statistical inferences. But if we take the word ‘extreme’ to mean ‘improbable’, then we can show that this type of inference can be very problematic. In this paper, I argue that it is a mistake to make such an interpretation. Under minimal assumptions about the alternative hypothesis, I explain why ‘extreme’ means ‘outside the most precise predicted range of experimental outcomes for a given upper bound probability of error’. Doing so, I rebut recent formulations of recurrent criticisms against the frequentist approach in statistics and underscore the importance of random variables
Random Forests: some methodological insights
This paper examines from an experimental perspective random forests, the
increasingly used statistical method for classification and regression problems
introduced by Leo Breiman in 2001. It first aims at confirming, known but
sparse, advice for using random forests and at proposing some complementary
remarks for both standard problems as well as high dimensional ones for which
the number of variables hugely exceeds the sample size. But the main
contribution of this paper is twofold: to provide some insights about the
behavior of the variable importance index based on random forests and in
addition, to propose to investigate two classical issues of variable selection.
The first one is to find important variables for interpretation and the second
one is more restrictive and try to design a good prediction model. The strategy
involves a ranking of explanatory variables using the random forests score of
importance and a stepwise ascending variable introduction strategy
Modeling heterogeneity in ranked responses by nonparametric maximum likelihood:How do Europeans get their scientific knowledge?
This paper is motivated by a Eurobarometer survey on science knowledge. As part of the survey, respondents were asked to rank sources of science information in order of importance. The official statistical analysis of these data however failed to use the complete ranking information. We instead propose a method which treats ranked data as a set of paired comparisons which places the problem in the standard framework of generalized linear models and also allows respondent covariates to be incorporated. An extension is proposed to allow for heterogeneity in the ranked responses. The resulting model uses a nonparametric formulation of the random effects structure, fitted using the EM algorithm. Each mass point is multivalued, with a parameter for each item. The resultant model is equivalent to a covariate latent class model, where the latent class profiles are provided by the mass point components and the covariates act on the class profiles. This provides an alternative interpretation of the fitted model. The approach is also suitable for paired comparison data
EPR Paradox,Locality and Completeness of Quantum Theory
The quantum theory (QT) and new stochastic approaches have no deterministic
prediction for a single measurement or for a single time -series of events
observed for a trapped ion, electron or any other individual physical system.
The predictions of QT being of probabilistic character apply to the statistical
distribution of the results obtained in various experiments. The probability
distribution is not an attribute of a dice but it is a characteristic of a
whole random experiment : '' rolling a dice''. and statistical long range
correlations between two random variables X and Y are not a proof of any causal
relation between these variable. Moreover any probabilistic model used to
describe a random experiment is consistent only with a specific protocol
telling how the random experiment has to be performed.In this sense the quantum
theory is a statistical and contextual theory of phenomena. In this paper we
discuss these important topics in some detail. Besides we discuss in historical
perspective various prerequisites used in the proofs of Bell and CHSH
inequalities concluding that the violation of these inequalities in spin
polarization correlation experiments is neither a proof of the completeness of
QT nor of its nonlocality. The question whether QT is predictably complete is
still open and it should be answered by a careful and unconventional analysis
of the experimental data. It is sufficient to analyze more in detail the
existing experimental data by using various non-parametric purity tests and
other specific statistical tools invented to study the fine structure of the
time-series. The correct understanding of statistical and contextual character
of QT has far reaching consequences for the quantum information and quantum
computing.Comment: 16 pages, 59 references,the contribution to the conference QTRF-4
held in Vaxjo, Sweden, 11-16 june 2007. To be published in the Proceeding
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
- …