3,242 research outputs found
Explicit probabilistic models for databases and networks
Recent work in data mining and related areas has highlighted the importance
of the statistical assessment of data mining results. Crucial to this endeavour
is the choice of a non-trivial null model for the data, to which the found
patterns can be contrasted. The most influential null models proposed so far
are defined in terms of invariants of the null distribution. Such null models
can be used by computation intensive randomization approaches in estimating the
statistical significance of data mining results.
Here, we introduce a methodology to construct non-trivial probabilistic
models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt
models allow for the natural incorporation of prior information. Furthermore,
they satisfy a number of desirable properties of previously introduced
randomization approaches. Lastly, they also have the benefit that they can be
represented explicitly. We argue that our approach can be used for a variety of
data types. However, for concreteness, we have chosen to demonstrate it in
particular for databases and networks.Comment: Submitte
Multiple Hypothesis Testing in Pattern Discovery
The problem of multiple hypothesis testing arises when there are more than
one hypothesis to be tested simultaneously for statistical significance. This
is a very common situation in many data mining applications. For instance,
assessing simultaneously the significance of all frequent itemsets of a single
dataset entails a host of hypothesis, one for each itemset. A multiple
hypothesis testing method is needed to control the number of false positives
(Type I error). Our contribution in this paper is to extend the multiple
hypothesis framework to be used with a generic data mining algorithm. We
provide a method that provably controls the family-wise error rate (FWER, the
probability of at least one false positive) in the strong sense. We evaluate
the performance of our solution on both real and generated data. The results
show that our method controls the FWER while maintaining the power of the test.Comment: 28 page
The cross-frequency mediation mechanism of intracortical information transactions
In a seminal paper by von Stein and Sarnthein (2000), it was hypothesized
that "bottom-up" information processing of "content" elicits local, high
frequency (beta-gamma) oscillations, whereas "top-down" processing is
"contextual", characterized by large scale integration spanning distant
cortical regions, and implemented by slower frequency (theta-alpha)
oscillations. This corresponds to a mechanism of cortical information
transactions, where synchronization of beta-gamma oscillations between distant
cortical regions is mediated by widespread theta-alpha oscillations. It is the
aim of this paper to express this hypothesis quantitatively, in terms of a
model that will allow testing this type of information transaction mechanism.
The basic methodology used here corresponds to statistical mediation analysis,
originally developed by (Baron and Kenny 1986). We generalize the classical
mediator model to the case of multivariate complex-valued data, consisting of
the discrete Fourier transform coefficients of signals of electric neuronal
activity, at different frequencies, and at different cortical locations. The
"mediation effect" is quantified here in a novel way, as the product of "dual
frequency RV-coupling coefficients", that were introduced in (Pascual-Marqui et
al 2016, http://arxiv.org/abs/1603.05343). Relevant statistical procedures are
presented for testing the cross-frequency mediation mechanism in general, and
in particular for testing the von Stein & Sarnthein hypothesis.Comment: https://doi.org/10.1101/119362 licensed as CC-BY-NC-ND 4.0
International license: http://creativecommons.org/licenses/by-nc-nd/4.0
The smallest set of constraints that explains the data : a randomization approach
Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and a test statistic. If the value of the test statistic on the original data set is more extreme than the test statistic on randomized data sets we can reject the null hypothesis. It is often not immediately clear why the null hypothesis is rejected. For example, the cost of clustering can be significantly lower in the original data than in the randomized data, but usually we would also like to know why the cost is small. We introduce a methodology for finding the smallest possible set of constraints, or patterns, that explains the data. In principle any type of patterns can be used as long as there exists an appropriate randomization method. We show that the problem is, in its general form, NP-hard, but that in a special case an exact solution can be computed fast, and propose a greedy algorithm that solves the problem. The proposed approach is demonstrated on time series data as well as on frequent itemsets in 0-1 matrices, and validated theoretically and experimentally
A Proposition for Fixing the Dimensionality of a Laplacian Low-rank Approximation of any Binary Data-matrix
International audienceLaplacian low-rank approximations are much appreciated in the context of graph spectral methods and Correspondence Analysis. We address here the problem of determining the dimensionality K* of the relevant eigenspace of a general binary datatable by a statistically well-founded method. We propose 1) a general framework for graph adjacency matrices and any rectangular binary matrix, 2) a randomization test for fixing K*. We illustrate with both artificial and real data
Randomization algorithms for assessing the significance of data mining results
Data mining is an interdisciplinary research area that develops general methods for finding interesting and useful knowledge from large collections of data. This thesis addresses from the computational point of view the problem of assessing whether the obtained data mining results are merely random artefacts in the data or something more interesting.
In randomization based significance testing, a result is compared with the results obtained on randomized data. The randomized data are assumed to share some basic properties with the original data. To apply the randomization approach, the first step is to define these properties. The next step is to develop algorithms that can produce such randomizations. Results on the real data that clearly differ from the results on the randomized data are not directly explained by the studied properties of the data.
In this thesis, new randomization methods are developed for four specific data mining scenarios. First, randomizing matrix data while preserving the distributions of values in rows and columns is studied. Next, a general randomization approach is introduced for iterative data mining. Randomization in multi-relational databases is also considered. Finally, a simple permutation method is given for assessing whether dependencies between features are exploited in classification.
The properties of the new randomization methods are analyzed theoretically. Extensive experiments are performed on real and artificial datasets. The randomization methods introduced in this thesis are useful in various data mining applications. The methods work well on different types of data, are easy to use, and provide meaningful information to further improve and understand the data mining results
- …