3,242 research outputs found

    Explicit probabilistic models for databases and networks

    Full text link
    Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a non-trivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct non-trivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We argue that our approach can be used for a variety of data types. However, for concreteness, we have chosen to demonstrate it in particular for databases and networks.Comment: Submitte

    Multiple Hypothesis Testing in Pattern Discovery

    Get PDF
    The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.Comment: 28 page

    The cross-frequency mediation mechanism of intracortical information transactions

    Full text link
    In a seminal paper by von Stein and Sarnthein (2000), it was hypothesized that "bottom-up" information processing of "content" elicits local, high frequency (beta-gamma) oscillations, whereas "top-down" processing is "contextual", characterized by large scale integration spanning distant cortical regions, and implemented by slower frequency (theta-alpha) oscillations. This corresponds to a mechanism of cortical information transactions, where synchronization of beta-gamma oscillations between distant cortical regions is mediated by widespread theta-alpha oscillations. It is the aim of this paper to express this hypothesis quantitatively, in terms of a model that will allow testing this type of information transaction mechanism. The basic methodology used here corresponds to statistical mediation analysis, originally developed by (Baron and Kenny 1986). We generalize the classical mediator model to the case of multivariate complex-valued data, consisting of the discrete Fourier transform coefficients of signals of electric neuronal activity, at different frequencies, and at different cortical locations. The "mediation effect" is quantified here in a novel way, as the product of "dual frequency RV-coupling coefficients", that were introduced in (Pascual-Marqui et al 2016, http://arxiv.org/abs/1603.05343). Relevant statistical procedures are presented for testing the cross-frequency mediation mechanism in general, and in particular for testing the von Stein & Sarnthein hypothesis.Comment: https://doi.org/10.1101/119362 licensed as CC-BY-NC-ND 4.0 International license: http://creativecommons.org/licenses/by-nc-nd/4.0

    The smallest set of constraints that explains the data : a randomization approach

    Get PDF
    Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and a test statistic. If the value of the test statistic on the original data set is more extreme than the test statistic on randomized data sets we can reject the null hypothesis. It is often not immediately clear why the null hypothesis is rejected. For example, the cost of clustering can be significantly lower in the original data than in the randomized data, but usually we would also like to know why the cost is small. We introduce a methodology for finding the smallest possible set of constraints, or patterns, that explains the data. In principle any type of patterns can be used as long as there exists an appropriate randomization method. We show that the problem is, in its general form, NP-hard, but that in a special case an exact solution can be computed fast, and propose a greedy algorithm that solves the problem. The proposed approach is demonstrated on time series data as well as on frequent itemsets in 0-1 matrices, and validated theoretically and experimentally

    A Proposition for Fixing the Dimensionality of a Laplacian Low-rank Approximation of any Binary Data-matrix

    Get PDF
    International audienceLaplacian low-rank approximations are much appreciated in the context of graph spectral methods and Correspondence Analysis. We address here the problem of determining the dimensionality K* of the relevant eigenspace of a general binary datatable by a statistically well-founded method. We propose 1) a general framework for graph adjacency matrices and any rectangular binary matrix, 2) a randomization test for fixing K*. We illustrate with both artificial and real data

    Randomization algorithms for assessing the significance of data mining results

    Get PDF
    Data mining is an interdisciplinary research area that develops general methods for finding interesting and useful knowledge from large collections of data. This thesis addresses from the computational point of view the problem of assessing whether the obtained data mining results are merely random artefacts in the data or something more interesting. In randomization based significance testing, a result is compared with the results obtained on randomized data. The randomized data are assumed to share some basic properties with the original data. To apply the randomization approach, the first step is to define these properties. The next step is to develop algorithms that can produce such randomizations. Results on the real data that clearly differ from the results on the randomized data are not directly explained by the studied properties of the data. In this thesis, new randomization methods are developed for four specific data mining scenarios. First, randomizing matrix data while preserving the distributions of values in rows and columns is studied. Next, a general randomization approach is introduced for iterative data mining. Randomization in multi-relational databases is also considered. Finally, a simple permutation method is given for assessing whether dependencies between features are exploited in classification. The properties of the new randomization methods are analyzed theoretically. Extensive experiments are performed on real and artificial datasets. The randomization methods introduced in this thesis are useful in various data mining applications. The methods work well on different types of data, are easy to use, and provide meaningful information to further improve and understand the data mining results
    • …
    corecore