13,419 research outputs found
A review of associative classification mining
Associative classification mining is a promising approach in data mining that utilizes the
association rule discovery techniques to construct classification systems, also known as
associative classifiers. In the last few years, a number of associative classification algorithms
have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms
employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule
evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative
classification techniques with regards to the above criteria. Finally, future directions in associative
classification, such as incremental learning and mining low-quality data sets, are also
highlighted in this paper
Novelty Detection in MultiClass Scenarios with Incomplete Set of Class Labels
We address the problem of novelty detection in multiclass scenarios where
some class labels are missing from the training set. Our method is based on the
initial assignment of confidence values, which measure the affinity between a
new test point and each known class. We first compare the values of the two top
elements in this vector of confidence values. In the heart of our method lies
the training of an ensemble of classifiers, each trained to discriminate known
from novel classes based on some partition of the training data into
presumed-known and presumednovel classes. Our final novelty score is derived
from the output of this ensemble of classifiers. We evaluated our method on two
datasets of images containing a relatively large number of classes - the
Caltech-256 and Cifar-100 datasets. We compared our method to 3 alternative
methods which represent commonly used approaches, including the one-class SVM,
novelty based on k-NN, novelty based on maximal confidence, and the recent
KNFST method. The results show a very clear and marked advantage for our method
over all alternative methods, in an experimental setup where class labels are
missing during training.Comment: 10 page
Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data
Most positive and unlabeled data is subject to selection biases. The labeled
examples can, for example, be selected from the positive set because they are
easier to obtain or more obviously positive. This paper investigates how
learning can be ena BHbled in this setting. We propose and theoretically
analyze an empirical-risk-based method for incorporating the labeling
mechanism. Additionally, we investigate under which assumptions learning is
possible when the labeling mechanism is not fully understood and propose a
practical method to enable this. Our empirical analysis supports the
theoretical results and shows that taking into account the possibility of a
selection bias, even when the labeling mechanism is unknown, improves the
trained classifiers
Target contrastive pessimistic risk for robust domain adaptation
In domain adaptation, classifiers with information from a source domain adapt
to generalize to a target domain. However, an adaptive classifier can perform
worse than a non-adaptive classifier due to invalid assumptions, increased
sensitivity to estimation errors or model misspecification. Our goal is to
develop a domain-adaptive classifier that is robust in the sense that it does
not rely on restrictive assumptions on how the source and target domains relate
to each other and that it does not perform worse than the non-adaptive
classifier. We formulate a conservative parameter estimator that only deviates
from the source classifier when a lower risk is guaranteed for all possible
labellings of the given target samples. We derive the classical least-squares
and discriminant analysis cases and show that these perform on par with
state-of-the-art domain adaptive classifiers in sample selection bias settings,
while outperforming them in more general domain adaptation settings.Comment: 35 pages, 3 figures, 6 tables, 2 algorithms, 1 theore
Multi-Label Learning with Global and Local Label Correlation
It is well-known that exploiting label correlations is important to
multi-label learning. Existing approaches either assume that the label
correlations are global and shared by all instances; or that the label
correlations are local and shared only by a data subset. In fact, in the
real-world applications, both cases may occur that some label correlations are
globally applicable and some are shared only in a local group of instances.
Moreover, it is also a usual case that only partial labels are observed, which
makes the exploitation of the label correlations much more difficult. That is,
it is hard to estimate the label correlations when many labels are absent. In
this paper, we propose a new multi-label approach GLOCAL dealing with both the
full-label and the missing-label cases, exploiting global and local label
correlations simultaneously, through learning a latent label representation and
optimizing label manifolds. The extensive experimental studies validate the
effectiveness of our approach on both full-label and missing-label data
Classifier Selection with Permutation Tests
This work presents a content-based recommender system for machine learning
classifier algorithms. Given a new data set, a recommendation of what
classifier is likely to perform best is made based on classifier performance
over similar known data sets. This similarity is measured according to a data
set characterization that includes several state-of-the-art metrics taking into
account physical structure, statis- tics, and information theory. A novelty
with respect to prior work is the use of a robust approach based on permutation
tests to directly assess whether a given learning algorithm is able to exploit
the attributes in a data set to predict class labels, and compare it to the
more commonly used F-score metric for evalu- ating classifier performance. To
evaluate our approach, we have conducted an extensive experimentation including
8 of the main machine learning classification methods with varying
configurations and 65 bi- nary data sets, leading to over 2331 experiments. Our
results show that using the information from the permutation test clearly
improves the quality of the recommendations.Comment: 20th International Conference of the Catalan Association for
Artificial Intelligence (CCIA 2017
Classifier selection with permutation tests
This work presents a content-based recommender system for machine learning classifier algorithms. Given a new data set, a recommendation of what classifier is likely to perform best is made based on classifier performance over similar known data sets. This similarity is measured according to a data set characterization that includes several state-of-the-art metrics taking into account physical structure, statistics, and information theory. A novelty with respect to prior work is the use of a robust approach based on permutation tests to directly assess whether a given learning algorithm is able to exploit the attributes in a data set to predict class labels, and compare it to the more commonly used F-score metric for evaluating classifier performance. To evaluate our approach, we have conducted an extensive experimentation including 8 of the main machine learning classification methods with varying configurations and 65 binary data sets, leading to over 2331 experiments. Our results show that using the information from the permutation test clearly improves the quality of the recommendations.Peer ReviewedPostprint (author's final draft
A Comparative Study for Predicting Heart Diseases Using Data Mining Classification Methods
Improving the precision of heart diseases detection has been investigated by
many researchers in the literature. Such improvement induced by the
overwhelming health care expenditures and erroneous diagnosis. As a result,
various methodologies have been proposed to analyze the disease factors aiming
to decrease the physicians practice variation and reduce medical costs and
errors. In this paper, our main motivation is to develop an effective
intelligent medical decision support system based on data mining techniques. In
this context, five data mining classifying algorithms, with large datasets,
have been utilized to assess and analyze the risk factors statistically related
to heart diseases in order to compare the performance of the implemented
classifiers (e.g., Na\"ive Bayes, Decision Tree, Discriminant, Random Forest,
and Support Vector Machine). To underscore the practical viability of our
approach, the selected classifiers have been implemented using MATLAB tool with
two datasets. Results of the conducted experiments showed that all
classification algorithms are predictive and can give relatively correct
answer. However, the decision tree outperforms other classifiers with an
accuracy rate of 99.0% followed by Random forest. That is the case because both
of them have relatively same mechanism but the Random forest can build ensemble
of decision tree. Although ensemble learning has been proved to produce
superior results, but in our case the decision tree has outperformed its
ensemble version
BoostClean: Automated Error Detection and Repair for Machine Learning
Predictive models based on machine learning can be highly sensitive to data
error. Training data are often combined with a variety of different sources,
each susceptible to different types of inconsistencies, and new data streams
during prediction time, the model may encounter previously unseen
inconsistencies. An important class of such inconsistencies is domain value
violations that occur when an attribute value is outside of an allowed domain.
We explore automatically detecting and repairing such violations by leveraging
the often available clean test labels to determine whether a given detection
and repair combination will improve model accuracy. We present BoostClean which
automatically selects an ensemble of error detection and repair combinations
using statistical boosting. BoostClean selects this ensemble from an extensible
library that is pre-populated general detection functions, including a novel
detector based on the Word2Vec deep learning model, which detects errors across
a diverse set of domains. Our evaluation on a collection of 12 datasets from
Kaggle, the UCI repository, real-world data analyses, and production datasets
that show that Boost- Clean can increase absolute prediction accuracy by up to
9% over the best non-ensembled alternatives. Our optimizations including
parallelism, materialization, and indexing techniques show a 22.2x end-to-end
speedup on a 16-core machine
Negative Link Prediction in Social Media
Signed network analysis has attracted increasing attention in recent years.
This is in part because research on signed network analysis suggests that
negative links have added value in the analytical process. A major impediment
in their effective use is that most social media sites do not enable users to
specify them explicitly. In other words, a gap exists between the importance of
negative links and their availability in real data sets. Therefore, it is
natural to explore whether one can predict negative links automatically from
the commonly available social network data. In this paper, we investigate the
novel problem of negative link prediction with only positive links and
content-centric interactions in social media. We make a number of important
observations about negative links, and propose a principled framework NeLP,
which can exploit positive links and content-centric interactions to predict
negative links. Our experimental results on real-world social networks
demonstrate that the proposed NeLP framework can accurately predict negative
links with positive links and content-centric interactions. Our detailed
experiments also illustrate the relative importance of various factors to the
effectiveness of the proposed framework
- …