19 research outputs found
Combining Classification and Clustering for Tweet Sentiment Analysis
The goal of sentiment analysis is to determine opinions, emotions, and attitudes presented in source material. In tweet sentiment analysis, opinions in messages can be typically categorized as positive or negative. To classify them, researchers have been using traditional classifiers like Naive Bayes, Maximum Entropy, and Support Vector Machines (SVM). In this paper, we show that a SVM classifier combined with a cluster ensemble can offer better classification accuracies than a stand-alone SVM. In our study, we employed an algorithm, named 'C POT.3'E-SL, capable to combine classifier and cluster ensembles. This algorithm can refine tweet classifications from additional information provided by clusterers, assuming that similar instances from the same clusters are more likely to share the same class label. The resulting classifier has shown to be competitive with the best results found so far in the literature, thereby suggesting that the studied approach is promising for tweet sentiment classification.Capes (Proc. DS-7253238/D)CNPq (Proc. 303348/2013-5)FAPESP (Proc. 2013/07375-0 and 2010/20830-0
Recommended from our members
Knowledge transfer using latent variable models
textIn several applications, scarcity of labeled data is a challenging problem that hinders the predictive capabilities of machine learning algorithms. Additionally, the distribution of the data changes over time, rendering models trained with older data less capable of discovering useful structure from the newly available data. Transfer learning is a convenient framework to overcome such problems where the learning of a model specific to a domain can benefit the learning of other models in other domains through either simultaneous training of domains or sequential transfer of knowledge from one domain to the others. This thesis explores the opportunities of knowledge transfer in the context of a few applications pertaining to object recognition from images, text analysis, network modeling and recommender systems, using probabilistic latent variable models as building blocks. Both simultaneous and sequential knowledge transfer are achieved through the latent variables, either by sharing these across multiple related domains (for simultaneous learning) or by adapting their distributions to fit data from a new domain (for sequential learning).Electrical and Computer Engineerin
Hybrid Automated Machine Learning System for Big Data
A lot of machine learning (ML) models and algorithms exist and in designing classification systems, it is often a challenge looking for and selecting the best performing ML algorithm(s) to use for a dataset in a short period of time. Often, one must learn thor-oughly about the data set structure and content, decide whether to use a supervised, semi-supervised or an unsupervised learning strategy, and then investigate, select or design via trial and error a classification or clustering algorithm that would work most accurately for that specific dataset. This can be quite a time consuming and tedious process. Additionally, a classification algorithm may not perform very well with a dataset as compared to using a clustering algorithm. Meta-learning (learning to learn) and automatic ML (autoML) are data mining-based formalisms for modelling evolving conventional ML functions and toolkit systems. The concept of modelling a decision tree-based combination of both formalisms as a Hybrid-AutoML toolkit extends that of traditional complex autoML systems.
In hybrid-autoML, single or multiple predictive models are built by combining a three-layered decision learning architecture for automatic learning mode and model selection, by engaging formal-isms for selecting from a variety of supervised or unsupervised ML algorithms and generic meta information obtained from varying multi-datasets. The work presented in this thesis aims to study, conceptualize, design and develop this hybrid-autoML toolkit. By extending in the simplest form, some existing methodologies for the model training aspect of autoML systems. The theoretical and experimental development focuses on the extension of autoWeka and use of existing meta-learning, algorithm selection and deci-sion tree concepts. It addresses the issue of efficient ML mode (supervised or unsupervised) and model selection for varying multi-datasets, learning methods representations of practical alternative use cases and structuring of layered decision ML un-folding, and algorithms for constructing the unfolding. The im-plementation aims to develop tools for hybrid-autoML based model visualization or evaluation, use case simulations and analysis on single or multi varying datasets. An open source tool called hybrid-autoML has been developed to support these functionali-ties. Hybrid-autoML provides a user-friendly graphical interface that facilitates single or multi varying datasets entry, sup-ports automatic learning mode or strategy selection, automatic model selection on single or multi-varying datasets, supports predictive testing, and allows the automatic visualization and use of a set of analytical tools for model evaluation. It is highly extensible and saves a lot of time
Semi-supervised learning using multiple clusterings with limited labeled data
Supervised classification consists in learning a predictive model using a set of labeled samples. It is accepted that predictive models accuracy usually increases as more labeled samples are available. Labeled samples are generally difficult to obtain as the labeling step if often performed manually. On the contrary, unlabeled samples are easily available. As the labeling task is tedious and time consuming, users generally provide a very limited number of labeled objects. However, designing approaches able to work efficiently with a very limited number of labeled samples is highly challenging. In this context, semi-supervised approaches have been proposed to leverage from both labeled and unlabeled data.
In this paper, we focus on cases where the number of labeled samples is very limited. We review and formalize eight semi-supervised learning algorithms and introduce a new method that combine supervised and unsupervised learning in order to use both labeled and unlabeled data. The main idea of this method is to produce new features derived from a first step of data clustering. These features are then used to enrich the description of the input data leading to a better use of the data distribution. The efficiency of all the methods is compared on various artificial, UCI datasets, and on the classification of a very high resolution remote sensing image. The experiments reveal that our method shows good results, especially when the number of labeled sample is very limited. It also confirms that combining labeled and unlabeled data is very useful in pattern recognition
Heterogeneous information fusion: combination of multiple supervised and unsupervised classification methods based on belief functions
International audienceIn real-life machine learning applications, a common problem is that raw data (e.g. remote sensing data) is sometimes inaccessible due to confidentiality and privacy constrains of corporations, making classification methods arduous to work in the supervised context. Moreover, even though raw data is accessible, limited labeled samples can also seriously affect supervised methods. Recently, supervised and unsupervised classification (clustering) results related to specific applications are published by more and more organizations. Therefore, combination of supervised classification and clustering results has gained increasing attention to improve the accuracy of supervised predictions. Incorporating clustering results with supervised classifications at the output level can help to lessen the recline on information at the raw data level, so that is pertinent to improve the accuracy for the applications when raw data is inaccessible or training samples are limited. We focus on the combination of multiple supervised classification and clustering results at the output level based on belief functions for three purposes: (1) to improve the accuracy of classification when raw data is inaccessible or training samples are highly limited; (2) to reduce uncertain and imprecise information in the supervised results; and (3) to study how supervised classification and clustering results affect the combination at the output level. Our contributions consist of a transformation method to transfer heterogeneous information into the same frame, and an iterative fusion strategy to retain most of the trustful information in multiple supervised classification and clustering results
Tree-based Density Estimation: Algorithms and Applications
Data Mining can be seen as an extension to statistics. It comprises the preparation
of data and the process of gathering new knowledge from it. The extraction of
new knowledge is supported by various machine learning methods. Many of the
algorithms are based on probabilistic principles or use density estimations for their
computations. Density estimation has been practised in the field of statistics for
several centuries. In the simplest case, a histogram estimator, like the simple equalwidth
histogram, can be used for this task and has been shown to be a practical
tool to represent the distribution of data visually and for computation. Like other
nonparametric approaches, it can provide a flexible solution. However, flexibility
in existing approaches is generally restricted because the size of the bins is fixed
either the width of the bins or the number of values in them. Attempts have been
made to generate histograms with a variable bin width and a variable number of
values per interval, but the computational approaches in these methods have proven
too difficult and too slow even with modern computer technology.
In this thesis new flexible histogram estimation methods are developed and tested
as part of various machine learning tasks, namely discretization, naive Bayes classification,
clustering and multiple-instance learning. Not only are the new density
estimation methods applied to machine learning tasks, they also borrow design
principles from algorithms that are ubiquitous in artificial intelligence: divide-andconquer
methods are a well known way to tackle large problems by dividing them
into small subproblems. Decision trees, used for machine learning classification,
successfully apply this approach. This thesis presents algorithms that build density
estimators using a binary split tree to cut a range of values into subranges of
varying length. No class values are required for this splitting process, making it an
unsupervised method. The result is a histogram estimator that adapts well even to
complex density functions a novel density estimation method with flexible density
estimation ability and good computational behaviour.
Algorithms are presented for both univariate and multivariate data. The univariate
histogram estimator is applied to discretization for density estimation and
also used as density estimator inside a naive Bayes classifier. The multivariate histogram,
used as the basis for a clustering method, is applied to improve the runtime
behaviour of a well-known algorithm for multiple-instance classification. Performance
in these applications is evaluated by comparing the new approaches with
existing methods