Search CORE

1 research outputs found

Publication II Markus Ojala. 2010. Assessing data mining results on matrices with randomization. In: Assessing Data Mining Results on Matrices with Randomization

Author: Bing Liu
Chengqi Zhang
Dimitrios Gunopulos
Geoffrey I Webb
Markus Ojala
Xindong Wu
Publication venue
Publication date: 24/04/2020
Field of study

Abstract-Randomization is a general technique for evaluating the significance of data analysis results. In randomizationbased significance testing, a result is considered to be interesting if it is unlikely to obtain as good result on random data sharing some basic properties with the original data. Recently, the randomization approach has been applied to assess data mining results on binary matrices and limited types of realvalued matrices. In these works, the row and column value distributions are approximately preserved in randomization. However, the previous approaches suffer from various technical and practical shortcomings. In this paper, we give solutions to these problems and introduce a new practical algorithm for randomizing various types of matrices while preserving the row and column value distributions more accurately. We propose a new approach for randomizing matrices containing features measured in different scales. Compared to previous work, our approach can be applied to assess data mining results on different types of real-life matrices containing dissimilar features, nominal values, non-Gaussian value distributions, missing values and sparse structure. We provide an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given. We perform extensive experiments on various real-life datasets showing that our approach produces reasonable results on practically all types of matrices while being easy and fast to use

CiteSeerX