Search CORE

3,242 research outputs found

Explicit probabilistic models for databases and networks

Author: De Bie Tijl
Publication venue
Publication date: 01/01/2009
Field of study

Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a non-trivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct non-trivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We argue that our approach can be used for a variety of data types. However, for concreteness, we have chosen to demonstrate it in particular for databases and networks.Comment: Submitte

arXiv.org e-Print Archive

Explore Bristol Research

Multiple Hypothesis Testing in Pattern Discovery

Author: Garriga Gemma C.
Hanhijärvi Sami
Puolamäki Kai
Publication venue
Publication date: 01/01/2009
Field of study

The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing simultaneously the significance of all frequent itemsets of a single dataset entails a host of hypothesis, one for each itemset. A multiple hypothesis testing method is needed to control the number of false positives (Type I error). Our contribution in this paper is to extend the multiple hypothesis framework to be used with a generic data mining algorithm. We provide a method that provably controls the family-wise error rate (FWER, the probability of at least one false positive) in the strong sense. We evaluate the performance of our solution on both real and generated data. The results show that our method controls the FWER while maintaining the power of the test.Comment: 28 page

arXiv.org e-Print Archive

Aaltodoc Publication Archive

The cross-frequency mediation mechanism of intracortical information transactions

Author: Faber P
Ikeda S
Ishii R
Kinoshita T
Kitaura Y
Kochi K
Milz P
Nishida K
Pascual-Marqui RD
Yoshimura M
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 23/03/2017
Field of study

In a seminal paper by von Stein and Sarnthein (2000), it was hypothesized that "bottom-up" information processing of "content" elicits local, high frequency (beta-gamma) oscillations, whereas "top-down" processing is "contextual", characterized by large scale integration spanning distant cortical regions, and implemented by slower frequency (theta-alpha) oscillations. This corresponds to a mechanism of cortical information transactions, where synchronization of beta-gamma oscillations between distant cortical regions is mediated by widespread theta-alpha oscillations. It is the aim of this paper to express this hypothesis quantitatively, in terms of a model that will allow testing this type of information transaction mechanism. The basic methodology used here corresponds to statistical mediation analysis, originally developed by (Baron and Kenny 1986). We generalize the classical mediator model to the case of multivariate complex-valued data, consisting of the discrete Fourier transform coefficients of signals of electric neuronal activity, at different frequencies, and at different cortical locations. The "mediation effect" is quantified here in a novel way, as the product of "dual frequency RV-coupling coefficients", that were introduced in (Pascual-Marqui et al 2016, http://arxiv.org/abs/1603.05343). Relevant statistical procedures are presented for testing the cross-frequency mediation mechanism in general, and in particular for testing the von Stein & Sarnthein hypothesis.Comment: https://doi.org/10.1101/119362 licensed as CC-BY-NC-ND 4.0 International license: http://creativecommons.org/licenses/by-nc-nd/4.0

arXiv.org e-Print Archive

ZORA

The smallest set of constraints that explains the data : a randomization approach

Author: Lijffijt Jefrey
Papapetrou Panagiotis
Puolamäki Kai
Vuokko Niko
Publication venue: Aalto-yliopiston teknillinen korkeakoulu
Publication date: 01/01/2010
Field of study

Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and a test statistic. If the value of the test statistic on the original data set is more extreme than the test statistic on randomized data sets we can reject the null hypothesis. It is often not immediately clear why the null hypothesis is rejected. For example, the cost of clustering can be significantly lower in the original data than in the randomized data, but usually we would also like to know why the cost is small. We introduce a methodology for finding the smallest possible set of constraints, or patterns, that explains the data. In principle any type of patterns can be used as long as there exists an appropriate randomization method. We show that the problem is, in its general form, NP-hard, but that in a special case an exact solution can be computed fast, and propose a greedy algorithm that solves the problem. The proposed approach is demonstrated on time series data as well as on frequent itemsets in 0-1 matrices, and validated theoretically and experimentally

Ghent University Academic Bibliography

Aaltodoc Publication Archive

A Proposition for Fixing the Dimensionality of a Laplacian Low-rank Approximation of any Binary Data-matrix

Author: Cadot Martine
Lelu Alain
Publication venue: IARIA
Publication date: 24/02/2013
Field of study

International audienceLaplacian low-rank approximations are much appreciated in the context of graph spectral methods and Correspondence Analysis. We address here the problem of determining the dimensionality K* of the relevant eigenspace of a general binary datatable by a statistically well-founded method. We propose 1) a general framework for graph adjacency matrices and any rectangular binary matrix, 2) a randomization test for fixing K*. We illustrate with both artificial and real data

HAL - Université de Franche-Comté

INRIA a CCSD electronic archive server

Randomization algorithms for assessing the significance of data mining results

Author: Ojala Markus
Publication venue: Aalto University, School of Arts, Design and Architecture, Department of Arts
Publication date: 01/01/2011
Field of study

Data mining is an interdisciplinary research area that develops general methods for finding interesting and useful knowledge from large collections of data. This thesis addresses from the computational point of view the problem of assessing whether the obtained data mining results are merely random artefacts in the data or something more interesting. In randomization based significance testing, a result is compared with the results obtained on randomized data. The randomized data are assumed to share some basic properties with the original data. To apply the randomization approach, the first step is to define these properties. The next step is to develop algorithms that can produce such randomizations. Results on the real data that clearly differ from the results on the randomized data are not directly explained by the studied properties of the data. In this thesis, new randomization methods are developed for four specific data mining scenarios. First, randomizing matrix data while preserving the distributions of values in rows and columns is studied. Next, a general randomization approach is introduced for iterative data mining. Randomization in multi-relational databases is also considered. Finally, a simple permutation method is given for assessing whether dependencies between features are exploited in classification. The properties of the new randomization methods are analyzed theoretically. Extensive experiments are performed on real and artificial datasets. The randomization methods introduced in this thesis are useful in various data mining applications. The methods work well on different types of data, are easy to use, and provide meaningful information to further improve and understand the data mining results

Aaltodoc Publication Archive