777 research outputs found
Active learning in annotating micro-blogs dealing with e-reputation
Elections unleash strong political views on Twitter, but what do people
really think about politics? Opinion and trend mining on micro blogs dealing
with politics has recently attracted researchers in several fields including
Information Retrieval and Machine Learning (ML). Since the performance of ML
and Natural Language Processing (NLP) approaches are limited by the amount and
quality of data available, one promising alternative for some tasks is the
automatic propagation of expert annotations. This paper intends to develop a
so-called active learning process for automatically annotating French language
tweets that deal with the image (i.e., representation, web reputation) of
politicians. Our main focus is on the methodology followed to build an original
annotated dataset expressing opinion from two French politicians over time. We
therefore review state of the art NLP-based ML algorithms to automatically
annotate tweets using a manual initiation step as bootstrap. This paper focuses
on key issues about active learning while building a large annotated data set
from noise. This will be introduced by human annotators, abundance of data and
the label distribution across data and entities. In turn, we show that Twitter
characteristics such as the author's name or hashtags can be considered as the
bearing point to not only improve automatic systems for Opinion Mining (OM) and
Topic Classification but also to reduce noise in human annotations. However, a
later thorough analysis shows that reducing noise might induce the loss of
crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science -
Vol 3 - Contextualisation digitale - 201
Asymmetric Pruning for Learning Cascade Detectors
Cascade classifiers are one of the most important contributions to real-time
object detection. Nonetheless, there are many challenging problems arising in
training cascade detectors. One common issue is that the node classifier is
trained with a symmetric classifier. Having a low misclassification error rate
does not guarantee an optimal node learning goal in cascade classifiers, i.e.,
an extremely high detection rate with a moderate false positive rate. In this
work, we present a new approach to train an effective node classifier in a
cascade detector. The algorithm is based on two key observations: 1) Redundant
weak classifiers can be safely discarded; 2) The final detector should satisfy
the asymmetric learning objective of the cascade architecture. To achieve this,
we separate the classifier training into two steps: finding a pool of
discriminative weak classifiers/features and training the final classifier by
pruning weak classifiers which contribute little to the asymmetric learning
criterion (asymmetric classifier construction). Our model reduction approach
helps accelerate the learning time while achieving the pre-determined learning
objective. Experimental results on both face and car data sets verify the
effectiveness of the proposed algorithm. On the FDDB face data sets, our
approach achieves the state-of-the-art performance, which demonstrates the
advantage of our approach.Comment: 14 page
Semi-Supervised Learning For Identifying Opinions In Web Content
Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented
Sentiment analysis on online social network
A large amount of data is maintained in every Social networking sites.The total data constantly gathered on these sites make it difficult for methods like use of field agents, clipping services and ad-hoc research to maintain social media data. This paper discusses the previous research on sentiment analysis
- …