4,481 research outputs found
A Survey on Multi-View Clustering
With advances in information acquisition technologies, multi-view data become
ubiquitous. Multi-view learning has thus become more and more popular in
machine learning and data mining fields. Multi-view unsupervised or
semi-supervised learning, such as co-training, co-regularization has gained
considerable attention. Although recently, multi-view clustering (MVC) methods
have been developed rapidly, there has not been a survey to summarize and
analyze the current progress. Therefore, this paper reviews the common
strategies for combining multiple views of data and based on this summary we
propose a novel taxonomy of the MVC approaches. We further discuss the
relationships between MVC and multi-view representation, ensemble clustering,
multi-task clustering, multi-view supervised and semi-supervised learning.
Several representative real-world applications are elaborated. To promote
future development of MVC, we envision several open problems that may require
further investigation and thorough examination.Comment: 17 pages, 4 figure
A review of heterogeneous data mining for brain disorders
With rapid advances in neuroimaging techniques, the research on brain
disorder identification has become an emerging area in the data mining
community. Brain disorder data poses many unique challenges for data mining
research. For example, the raw data generated by neuroimaging experiments is in
tensor representations, with typical characteristics of high dimensionality,
structural complexity and nonlinear separability. Furthermore, brain
connectivity networks can be constructed from the tensor data, embedding subtle
interactions between brain regions. Other clinical measures are usually
available reflecting the disease status from different perspectives. It is
expected that integrating complementary information in the tensor data and the
brain network data, and incorporating other clinical parameters will be
potentially transformative for investigating disease mechanisms and for
informing therapeutic interventions. Many research efforts have been devoted to
this area. They have achieved great success in various applications, such as
tensor-based modeling, subgraph pattern mining, multi-view feature analysis. In
this paper, we review some recent data mining methods that are used for
analyzing brain disorders
A feature construction framework based on outlier detection and discriminative pattern mining
No matter the expressive power and sophistication of supervised learning
algorithms, their effectiveness is restricted by the features describing the
data. This is not a new insight in ML and many methods for feature selection,
transformation, and construction have been developed. But while this is
on-going for general techniques for feature selection and transformation, i.e.
dimensionality reduction, work on feature construction, i.e. enriching the
data, is by now mainly the domain of image, particularly character,
recognition, and NLP.
In this work, we propose a new general framework for feature construction.
The need for feature construction in a data set is indicated by class outliers
and discriminative pattern mining used to derive features on their
k-neighborhoods. We instantiate the framework with LOF and C4.5-Rules, and
evaluate the usefulness of the derived features on a diverse collection of UCI
data sets. The derived features are more often useful than ones derived by
DC-Fringe, and our approach is much less likely to overfit. But while a weak
learner, Naive Bayes, benefits strongly from the feature construction, the
effect is less pronounced for C4.5, and almost vanishes for an SVM leaner.
Keywords: feature construction, classification, outlier detectio
Discriminative Subnetworks with Regularized Spectral Learning for Global-state Network Data
Data mining practitioners are facing challenges from data with network
structure. In this paper, we address a specific class of global-state networks
which comprises of a set of network instances sharing a similar structure yet
having different values at local nodes. Each instance is associated with a
global state which indicates the occurrence of an event. The objective is to
uncover a small set of discriminative subnetworks that can optimally classify
global network values. Unlike most existing studies which explore an
exponential subnetwork space, we address this difficult problem by adopting a
space transformation approach. Specifically, we present an algorithm that
optimizes a constrained dual-objective function to learn a low-dimensional
subspace that is capable of discriminating networks labelled by different
global states, while reconciling with common network topology sharing across
instances. Our algorithm takes an appealing approach from spectral graph
learning and we show that the globally optimum solution can be achieved via
matrix eigen-decomposition.Comment: manuscript for the ECML 2014 pape
Association Analysis Techniques for Bioinformatics Problems
Abstract. Association analysis is one of the most popular analysis paradigms in data mining. Despite the solid foundation of association analysis and its potential applications, this group of techniques is not as widely used as classification and clustering, especially in the domain of bioinformatics and computational biology. In this paper, we present different types of association patterns and discuss some of their applications in bioinformatics. We present a case study showing the usefulness of association analysis-based techniques for pre-processing protein interaction networks for the task of protein function prediction. Finally, we discuss some of the challenges that need to be addressed to make association analysis-based techniques more applicable for a number of interesting problems in bioinformatics
Feature Selection: A Data Perspective
Feature selection, as a data preprocessing strategy, has been proven to be
effective and efficient in preparing data (especially high-dimensional data)
for various data mining and machine learning problems. The objectives of
feature selection include: building simpler and more comprehensible models,
improving data mining performance, and preparing clean, understandable data.
The recent proliferation of big data has presented some substantial challenges
and opportunities to feature selection. In this survey, we provide a
comprehensive and structured overview of recent advances in feature selection
research. Motivated by current challenges and opportunities in the era of big
data, we revisit feature selection research from a data perspective and review
representative feature selection algorithms for conventional data, structured
data, heterogeneous data and streaming data. Methodologically, to emphasize the
differences and similarities of most existing feature selection algorithms for
conventional data, we categorize them into four main groups: similarity based,
information theoretical based, sparse learning based and statistical based
methods. To facilitate and promote the research in this community, we also
present an open-source feature selection repository that consists of most of
the popular feature selection algorithms
(\url{http://featureselection.asu.edu/}). Also, we use it as an example to show
how to evaluate feature selection algorithms. At the end of the survey, we
present a discussion about some open problems and challenges that require more
attention in future research
Salient Object Detection: A Distinctive Feature Integration Model
We propose a novel method for salient object detection in different images.
Our method integrates spatial features for efficient and robust representation
to capture meaningful information about the salient objects. We then train a
conditional random field (CRF) using the integrated features. The trained CRF
model is then used to detect salient objects during the online testing stage.
We perform experiments on two standard datasets and compare the performance of
our method with different reference methods. Our experiments show that our
method outperforms the compared methods in terms of precision, recall, and
F-Measure
Combining complex networks and data mining: why and how
The increasing power of computer technology does not dispense with the need
to extract meaningful in- formation out of data sets of ever growing size, and
indeed typically exacerbates the complexity of this task. To tackle this
general problem, two methods have emerged, at chronologically different times,
that are now commonly used in the scientific community: data mining and complex
network theory. Not only do complex network analysis and data mining share the
same general goal, that of extracting information from complex systems to
ultimately create a new compact quantifiable representation, but they also
often address similar problems too. In the face of that, a surprisingly low
number of researchers turn out to resort to both methodologies. One may then be
tempted to conclude that these two fields are either largely redundant or
totally antithetic. The starting point of this review is that this state of
affairs should be put down to contingent rather than conceptual differences,
and that these two fields can in fact advantageously be used in a synergistic
manner. An overview of both fields is first provided, some fundamental concepts
of which are illustrated. A variety of contexts in which complex network theory
and data mining have been used in a synergistic manner are then presented.
Contexts in which the appropriate integration of complex network metrics can
lead to improved classification rates with respect to classical data mining
algorithms and, conversely, contexts in which data mining can be used to tackle
important issues in complex network theory applications are illustrated.
Finally, ways to achieve a tighter integration between complex networks and
data mining, and open lines of research are discussed.Comment: 58 pages, 19 figure
Convex Formulation of Multiple Instance Learning from Positive and Unlabeled Bags
Multiple instance learning (MIL) is a variation of traditional supervised
learning problems where data (referred to as bags) are composed of sub-elements
(referred to as instances) and only bag labels are available. MIL has a variety
of applications such as content-based image retrieval, text categorization and
medical diagnosis. Most of the previous work for MIL assume that the training
bags are fully labeled. However, it is often difficult to obtain an enough
number of labeled bags in practical situations, while many unlabeled bags are
available. A learning framework called PU learning (positive and unlabeled
learning) can address this problem. In this paper, we propose a convex PU
learning method to solve an MIL problem. We experimentally show that the
proposed method achieves better performance with significantly lower
computational costs than an existing method for PU-MIL
Scalable Prototype Selection by Genetic Algorithms and Hashing
Classification in the dissimilarity space has become a very active research
area since it provides a possibility to learn from data given in the form of
pairwise non-metric dissimilarities, which otherwise would be difficult to cope
with. The selection of prototypes is a key step for the further creation of the
space. However, despite previous efforts to find good prototypes, how to select
the best representation set remains an open issue. In this paper we proposed
scalable methods to select the set of prototypes out of very large datasets.
The methods are based on genetic algorithms, dissimilarity-based hashing, and
two different unsupervised and supervised scalable criteria. The unsupervised
criterion is based on the Minimum Spanning Tree of the graph created by the
prototypes as nodes and the dissimilarities as edges. The supervised criterion
is based on counting matching labels of objects and their closest prototypes.
The suitability of these type of algorithms is analyzed for the specific case
of dissimilarity representations. The experimental results showed that the
methods select good prototypes taking advantage of the large datasets, and they
do so at low runtimes.Comment: 26 pages, 8 figure
- …