904 research outputs found

    App Review Analysis via Active Learning: Reducing Supervision Effort Without Compromising Classification Accuracy

    Get PDF
    Automated app review analysis is an important avenue for extracting a variety of requirements-related information. Typically, a first step toward performing such analysis is preparing a training dataset, where developers(experts) identify a set of reviews and, manually, annotate them according to a given task. Having sufficiently large training data is important for both achieving a high prediction accuracy and avoiding over-fitting. Given millions of reviews, preparing a training set is laborious.We propose to incorporate active learning, a machine learning paradigm,in order to reduce the human effort involved in app review analysis. Our app review classification framework exploits three active learning strategies based on uncertainty sampling. We apply these strategies to an existing dataset of 4,400 app reviews for classifying app reviews as features, bugs, rating, and user experience. We find that active learning, compared to a training dataset chosen randomly, yields a significantly higher prediction accuracy under multiple scenarios

    Active Learning for Text Classification

    Get PDF
    Text classification approaches are used extensively to solve real-world challenges. The success or failure of text classification systems hangs on the datasets used to train them, without a good dataset it is impossible to build a quality system. This thesis examines the applicability of active learning in text classification for the rapid and economical creation of labelled training data. Four main contributions are made in this thesis. First, we present two novel selection strategies to choose the most informative examples for manually labelling. One is an approach using an advanced aggregated confidence measurement instead of the direct output of classifiers to measure the confidence of the prediction and choose the examples with least confidence for querying. The other is a simple but effective exploration guided active learning selection strategy which uses only the notions of density and diversity, based on similarity, in its selection strategy. Second, we propose new methods of using deterministic clustering algorithms to help bootstrap the active learning process. We first illustrate the problems of using non-deterministic clustering for selecting initial training sets, showing how non-deterministic clustering methods can result in inconsistent behaviour in the active learning process. We then compare various deterministic clustering techniques and commonly used non-deterministic ones, and show that deterministic clustering algorithms are as good as non-deterministic clustering algorithms at selecting initial training examples for the active learning process. More importantly, we show that the use of deterministic approaches stabilises the active learning process. Our third direction is in the area of visualising the active learning process. We demonstrate the use of an existing visualisation technique in understanding active learning selection strategies to show that a better understanding of selection strategies can be achieved with the help of visualisation techniques. Finally, to evaluate the practicality and usefulness of active learning as a general dataset labelling methodology, it is desirable that actively labelled dataset can be reused more widely instead of being only limited to some particular classifier. We compare the reusability of popular active learning methods for text classification and identify the best classifiers to use in active learning for text classification. This thesis is concerned using active learning methods to label large unlabelled textual datasets. Our domain of interest is text classification, but most of the methods proposed are quite general and so are applicable to other domains having large collections of data with high dimensionality

    Detección de fraude fiscal en alquiler de pisos turísticos mediante técnicas de clasificación positive-unlabeled

    Get PDF
    El objetivo principal de este trabajo final de master consiste en la identificación de alojamientos turísticos fraudulentos a partir de datos extraídos de webs de alojamiento turístico. Se trata de un problema de clasificación semisupervisada o, más concretamente, aprendizaje a partir de datos positivos y no etiquetados. Además de un modelo capaz de detectar el fraude fiscal, también es necesario un método de evaluación del modelo fiable para este tipo de clasificación particular

    Approaching Sentiment Analysis by Using Semi-supervised Learning of Multidimensional Classifiers

    Get PDF
    Sentiment Analysis is defined as the computational study of opinions, sentiments and emotions expressed in text. Within this broad field, most of the work has been focused on either Sentiment Polarity classification, where a text is classified as having positive or negative sentiment, or Subjectivity classification, in which a text is classified as being subjective or objective. However, in this paper, we consider instead a real-world problem in which the attitude of the author is characterised by three different (but related) target variables: Subjectivity, Sentiment Polarity, Will to Influence, unlike the two previously stated problems, where there is only a single variable to be predicted. For that reason, the (uni-dimensional) common approaches used in this area yield suboptimal solutions to this problem. In order to bridge this gap, we propose, for the first time, the use of the novel multi-dimensional classification paradigm in the Sentiment Analysis domain. This methodology is able to join the different target variables in the same classification task so as to take advantage of the potential statistical relations between them. In addition, and in order to take advantage of the huge amount of unlabelled information available nowadays in this context, we propose the extension of the multi-dimensional classification framework to the semi-supervised domain. Experimental results for this problem show that our semi-supervised multi-dimensional approach outperforms the most common Sentiment Analysis approaches, concluding that our approach is beneficial to improve the recognition rates for this problem, and in extension, could be considered to solve future Sentiment Analysis problems

    Multilingual Twitter Sentiment Classification: The Role of Human Annotators

    Get PDF
    What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered

    Theoretical and methodological advances in semi-supervised learning and the class-imbalance problem.

    Get PDF
    201 p.Este trabajo se centra en la generalización teórica y práctica de dos situaciones desafiantes y conocidas del campo del aprendizaje automático a problemas de clasificación en los cuales la suposición de tener una única clase binaria no se cumple.Aprendizaje semi-supervisado es una técnica que usa grandes cantidades de datos no etiquetados para, así, mejorar el rendimiento del aprendizaje supervisado cuando el conjunto de datos etiquetados es muy acotado. Concretamente, este trabajo contribuye con metodologías potentes y computacionalmente eficientes para aprender, de forma semi-supervisada, clasificadores para múltiples variables clase. También, se investigan, de forma teórica, los límites fundamentales del aprendizaje semi-supervisado en problemas multiclase.El problema de desbalanceo de clases aparece cuando las variables objetivo presentan una distribución de probabilidad lo suficientemente desbalanceada como para desvirtuar las soluciones propuestas por los algoritmos de aprendizaje supervisado tradicionales. En este proyecto, se propone un marco teórico para separar la desvirtuación producida por el desbalanceo de clases de otros factores que afectan a la precisión de los clasificadores. Este marco es usado principalmente para realizar una recomendación de métricas de evaluación de clasificadores en esta situación. Por último, también se propone una medida del grado de desbalanceo de clases en un conjunto de datos correlacionada con la pérdida de precisión ocasionada.Intelligent Systems Grou

    Theoretical and Methodological Advances in Semi-supervised Learning and the Class-Imbalance Problem

    Get PDF
    his paper focuses on the theoretical and practical generalization of two known and challenging situations from the field of machine learning to classification problems in which the assumption of having a single binary class is not fulfilled.semi-supervised learning is a technique that uses large amounts of unlabeled data to improve the performance of supervised learning when the labeled data set is very limited. Specifically, this work contributes with powerful and computationally efficient methodologies to learn, in a semi-supervised way, classifiers for multiple class variables. Also, the fundamental limits of semi-supervised learning in multi-class problems are investigated in a theoretical way. The problem of class unbalance appears when the target variables present a probability distribution unbalanced enough to distort the solutions proposed by the traditional supervised learning algorithms. In this project, a theoretical framework is proposed to separate the deviation produced by class unbalance from other factors that affect the accuracy of classifiers. This framework is mainly used to make a recommendation of classifier assessment metrics in this situation. Finally, a measure of the degree of class unbalance in a data set correlated with the loss of accuracy caused is also proposed
    • …
    corecore