1,454 research outputs found
Explicit probabilistic models for databases and networks
Recent work in data mining and related areas has highlighted the importance
of the statistical assessment of data mining results. Crucial to this endeavour
is the choice of a non-trivial null model for the data, to which the found
patterns can be contrasted. The most influential null models proposed so far
are defined in terms of invariants of the null distribution. Such null models
can be used by computation intensive randomization approaches in estimating the
statistical significance of data mining results.
Here, we introduce a methodology to construct non-trivial probabilistic
models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt
models allow for the natural incorporation of prior information. Furthermore,
they satisfy a number of desirable properties of previously introduced
randomization approaches. Lastly, they also have the benefit that they can be
represented explicitly. We argue that our approach can be used for a variety of
data types. However, for concreteness, we have chosen to demonstrate it in
particular for databases and networks.Comment: Submitte
Multiple Hypothesis Testing in Pattern Discovery
The problem of multiple hypothesis testing arises when there are more than
one hypothesis to be tested simultaneously for statistical significance. This
is a very common situation in many data mining applications. For instance,
assessing simultaneously the significance of all frequent itemsets of a single
dataset entails a host of hypothesis, one for each itemset. A multiple
hypothesis testing method is needed to control the number of false positives
(Type I error). Our contribution in this paper is to extend the multiple
hypothesis framework to be used with a generic data mining algorithm. We
provide a method that provably controls the family-wise error rate (FWER, the
probability of at least one false positive) in the strong sense. We evaluate
the performance of our solution on both real and generated data. The results
show that our method controls the FWER while maintaining the power of the test.Comment: 28 page
The smallest set of constraints that explains the data : a randomization approach
Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and a test statistic. If the value of the test statistic on the original data set is more extreme than the test statistic on randomized data sets we can reject the null hypothesis. It is often not immediately clear why the null hypothesis is rejected. For example, the cost of clustering can be significantly lower in the original data than in the randomized data, but usually we would also like to know why the cost is small. We introduce a methodology for finding the smallest possible set of constraints, or patterns, that explains the data. In principle any type of patterns can be used as long as there exists an appropriate randomization method. We show that the problem is, in its general form, NP-hard, but that in a special case an exact solution can be computed fast, and propose a greedy algorithm that solves the problem. The proposed approach is demonstrated on time series data as well as on frequent itemsets in 0-1 matrices, and validated theoretically and experimentally
Time series classification with ensembles of elastic distance measures
Several alternative distance measures for comparing time series have recently been proposed and evaluated on time series classification (TSC) problems. These include variants of dynamic time warping (DTW), such as weighted and derivative DTW, and edit distance-based measures, including longest common subsequence, edit distance with real penalty, time warp with edit, and move–split–merge. These measures have the common characteristic that they operate in the time domain and compensate for potential localised misalignment through some elastic adjustment. Our aim is to experimentally test two hypotheses related to these distance measures. Firstly, we test whether there is any significant difference in accuracy for TSC problems between nearest neighbour classifiers using these distance measures. Secondly, we test whether combining these elastic distance measures through simple ensemble schemes gives significantly better accuracy. We test these hypotheses by carrying out one of the largest experimental studies ever conducted into time series classification. Our first key finding is that there is no significant difference between the elastic distance measures in terms of classification accuracy on our data sets. Our second finding, and the major contribution of this work, is to define an ensemble classifier that significantly outperforms the individual classifiers. We also demonstrate that the ensemble is more accurate than approaches not based in the time domain. Nearly all TSC papers in the data mining literature cite DTW (with warping window set through cross validation) as the benchmark for comparison. We believe that our ensemble is the first ever classifier to significantly outperform DTW and as such raises the bar for future work in this area
The Bootstrap for Testing the Equality of Two Multivariate Stochastic Processes with an Application to Financial Markets
[Abstract] The problem of testing the equality of generating processes of two multivariate time series is addressed in this work. To this end, we construct two tests based on a distance measure between stochastic processes. The metric is defined in terms of the quantile cross-spectral densities of both processes. A proper estimate of this dissimilarity is the cornerstone of the proposed tests. Both techniques are based on the bootstrap. Specifically, extensions of the moving block bootstrap and the stationary bootstrap are used for their construction. The approaches are assessed in a broad range of scenarios under the null and the alternative hypotheses. The results from the analyses show that the procedure based on the stationary bootstrap exhibits the best overall performance in terms of both size and power. The proposed techniques are used to answer the question regarding whether or not the dotcom bubble crash of the 2000s permanently impacted global market behavior.This research has been supported by MINECO (MTM2017-82724-R and PID2020-113578RB-100), the Xunta de Galicia (ED431C-2020-14), and “CITIC” (ED431G 2019/01)Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0
- …