1,363 research outputs found
Rough sets theory for travel demand analysis in Malaysia
This study integrates the rough sets theory into tourism demand analysis. Originated from the area of Artificial Intelligence, the rough sets theory was introduced to disclose important structures and to classify objects. The Rough Sets methodology provides definitions and methods for finding which attributes separates one class or classification from another. Based on this theory can propose a formal framework for the automated transformation of data into knowledge. This makes the rough sets approach a useful classification and pattern recognition technique. This study introduces a new rough sets approach for deriving rules from information table of tourist in Malaysia. The induced rules were able to forecast change in demand with certain accuracy
Network Model Selection Using Task-Focused Minimum Description Length
Networks are fundamental models for data used in practically every
application domain. In most instances, several implicit or explicit choices
about the network definition impact the translation of underlying data to a
network representation, and the subsequent question(s) about the underlying
system being represented. Users of downstream network data may not even be
aware of these choices or their impacts. We propose a task-focused network
model selection methodology which addresses several key challenges. Our
approach constructs network models from underlying data and uses minimum
description length (MDL) criteria for selection. Our methodology measures
efficiency, a general and comparable measure of the network's performance of a
local (i.e. node-level) predictive task of interest. Selection on efficiency
favors parsimonious (e.g. sparse) models to avoid overfitting and can be
applied across arbitrary tasks and representations. We show stability,
sensitivity, and significance testing in our methodology
Automatically estimating iSAX parameters
The Symbolic Aggregate Approximation (iSAX) is widely used in time
series data mining. Its popularity arises from the fact that it largely reduces time
series size, it is symbolic, allows lower bounding and is space efficient. However,
it requires setting two parameters: the symbolic length and alphabet size, which
limits the applicability of the technique. The optimal parameter values are highly
application dependent. Typically, they are either set to a fixed value or experimentally
probed for the best configuration. In this work we propose an approach to
automatically estimate iSAXās parameters. The approach ā AutoiSAX ā not only
discovers the best parameter setting for each time series in the database, but also
finds the alphabet size for each iSAX symbol within the same word. It is based on
simple and intuitive ideas from time series complexity and statistics. The technique
can be smoothly embedded in existing data mining tasks as an efficient sub-routine.
We analyze its impact in visualization interpretability, classification accuracy and
motif mining. Our contribution aims to make iSAX a more general approach as it
evolves towards a parameter-free method
Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework
When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant state-of-the-art estimators
Recommended from our members
Evolutionary computation-based feature selection for finding a stable set of features in high-dimensional data
Evolutionary Computation (EC) algorithms have proved to work well for feature selection because they are powerful search techniques and can produce multiple good solutions. However, they suļ¬er from some limitations for real world applications. Firstly, ECs require high computation time as they evaluate many solutions at each iteration. Secondly, a classiļ¬er is usually used as their ļ¬tness function which causes the selected subset to perform well only on the utilised classiļ¬er (e.g. classiļ¬er-bias). Lastly, ECs, as stochastic search methods, return a diļ¬erent ļ¬nal subset in diļ¬erent runs which poses a problem for ļ¬nding a stable set of features (e.g. stability issue). To address computation time and classiļ¬er-bias limitations, this thesis proposes a new two-stage selection approach called ļ¬lter/ļ¬lter in which two ļ¬lter feature selection algorithms are combined. In the ļ¬rst stage, a ranking algorithm forms a reduced dataset by selecting the most informative features from the original dataset. In the second stage, the reduced dataset is fed to a novel EC algorithm to select ļ¬nal feature subset. This new EC algorithm is a Tabu search hybridised with an Asexual Genetic Algorithm called TAGA. TAGA beneļ¬ts from new search components and solution representation which can eļ¬ectively reduce computation time. To select a classiļ¬er-unbiased ļ¬nal subset, a statistical criterion is used as the ļ¬tness function which evaluates the subset independent of any classiļ¬er. Experiments show that the proposed ļ¬lter/ļ¬lter requires an acceptable computation time and selects more classiļ¬er-unbiased features compared to the state-of-the-arts. To ļ¬nd a stable set of features, a novel Generalisation Power Index (GPI) is proposed to analyse the generalisation power of ļ¬nal subsets of an EC in several runs. Generalisation power refers to performance capability of a subset over wide range of classiļ¬ers. Computation results conļ¬rm that GPI is able to ļ¬nd a stable set of features which achieves near optimal accuracy when used to train various classiļ¬ers. To ex amine the suitability of the proposed methods for real-world applications, the ļ¬lter/ļ¬lter approach and GPI are integrated to select a stable set of features for METABRIC breast cancer subtype classiļ¬cation problem. Experimental results show that this integration not only can address the limitations of ECs for a real-world biomedical feature selection problem but it performs better than alternatives methods
The Minimum Description Length Principle for Pattern Mining: A Survey
This is about the Minimum Description Length (MDL) principle applied to
pattern mining. The length of this description is kept to the minimum.
Mining patterns is a core task in data analysis and, beyond issues of
efficient enumeration, the selection of patterns constitutes a major challenge.
The MDL principle, a model selection method grounded in information theory, has
been applied to pattern mining with the aim to obtain compact high-quality sets
of patterns. After giving an outline of relevant concepts from information
theory and coding, as well as of work on the theory behind the MDL and similar
principles, we review MDL-based methods for mining various types of data and
patterns. Finally, we open a discussion on some issues regarding these methods,
and highlight currently active related data analysis problems
An MDL-based wavelet scattering features selection for signal classification
Wavelet scattering is a redundant time-frequency transform that was shown to be a
powerful tool in signal classification. It shares the convolutional architecture with convolutional
neural networks, but it offers some advantages, including faster training and small training sets.
However, it introduces some redundancy along the frequency axis, especially for filters that have
a high degree of overlap. This naturally leads to a need for dimensionality reduction to further
increase its efficiency as a machine learning tool. In this paper, the Minimum Description Length is
used to define an automatic procedure for optimizing the selection of the scattering features, even
in the frequency domain. The proposed study is limited to the class of uniform sampling models.
Experimental results show that the proposed method is able to automatically select the optimal
sampling step that guarantees the highest classification accuracy for fixed transform parameters,
when applied to audio/sound signals
{MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization
Matrix factorizationsāwhere a given data matrix is approximated by a prod- uct of two or more factor matricesāare powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the āmodel order selection problemā of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)āwhere data, factors, and matrix product are Booleanāhas received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in generalāmaking it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior
- ā¦