4,973 research outputs found
Aspect term extraction for sentiment analysis in large movie reviews using Gini Index feature selection method and SVM classifier
With the rapid development of the World Wide Web, electronic word-of-mouth interaction has made consumers active participants. Nowadays, a large number of reviews posted by the consumers on the Web provide valuable information to other consumers. Such information is highly essential for decision making and hence popular among the internet users. This information is very valuable not only for prospective consumers to make decisions but also for businesses in predicting the success and sustainability. In this paper, a Gini Index based feature selection method with Support Vector Machine (SVM) classifier is proposed for sentiment classification for large movie review data set. The results show that our Gini Index method has better classification performance in terms of reduced error rate and accuracy
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data
Social media provide access to behavioural data at an unprecedented scale and
granularity. However, using these data to understand phenomena in a broader
population is difficult due to their non-representativeness and the bias of
statistical inference tools towards dominant languages and groups. While
demographic attribute inference could be used to mitigate such bias, current
techniques are almost entirely monolingual and fail to work in a global
environment. We address these challenges by combining multilingual demographic
inference with post-stratification to create a more representative population
sample. To learn demographic attributes, we create a new multimodal deep neural
architecture for joint classification of age, gender, and organization-status
of social media users that operates in 32 languages. This method substantially
outperforms current state of the art while also reducing algorithmic bias. To
correct for sampling biases, we propose fully interpretable multilevel
regression methods that estimate inclusion probabilities from inferred joint
population counts and ground-truth population counts. In a large experiment
over multilingual heterogeneous European regions, we show that our demographic
inference and bias correction together allow for more accurate estimates of
populations and make a significant step towards representative social sensing
in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web
Conference (WWW '19
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
What are the limits of automated Twitter sentiment classification? We analyze
a large set of manually labeled tweets in different languages, use them as
training data, and construct automated classification models. It turns out that
the quality of classification models depends much more on the quality and size
of training data than on the type of the model trained. Experimental results
indicate that there is no statistically significant difference between the
performance of the top classification models. We quantify the quality of
training data by applying various annotator agreement measures, and identify
the weakest points of different datasets. We show that the model performance
approaches the inter-annotator agreement when the size of the training set is
sufficiently large. However, it is crucial to regularly monitor the self- and
inter-annotator agreements since this improves the training datasets and
consequently the model performance. Finally, we show that there is strong
evidence that humans perceive the sentiment classes (negative, neutral, and
positive) as ordered
Semi-supervised learning and fairness-aware learning under class imbalance
With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms.
In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions:
• Tackling class-imbalance in semi-supervised learning – The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process.
• Mitigating unfairness in supervised models – Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes
Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview
An increasing number of works in natural language processing have addressed
the effect of bias on the predicted outcomes, introducing mitigation techniques
that act on different parts of the standard NLP pipeline (data and models).
However, these works have been conducted in isolation, without a unifying
framework to organize efforts within the field. This leads to repetitive
approaches, and puts an undue focus on the effects of bias, rather than on
their origins. Research focused on bias symptoms rather than the underlying
origins could limit the development of effective countermeasures. In this
paper, we propose a unifying conceptualization: the predictive bias framework
for NLP. We summarize the NLP literature and propose a general mathematical
definition of predictive bias in NLP along with a conceptual framework,
differentiating four main origins of biases: label bias, selection bias, model
overamplification, and semantic bias. We discuss how past work has countered
each bias origin. Our framework serves to guide an introductory overview of
predictive bias in NLP, integrating existing work into a single structure and
opening avenues for future research.Comment: 9 pages excluding references, 1 figure, 3 pages for appendi
Classification of drugs reviews using W-LRSVM model
Opinion mining provided less opportunity to discuss their experiences about drugs so reviewing about it was difficult. Recent findings show that online reviews and blogs on drugs are important for patients, marketers and industries. Collecting the information for drugs from the website and analyzing is a challenge. A model is designed by proposing an algorithm which crawls information from the web to analyze reviews of drugs. Reviews were crawled for five different drugs using the algorithm. The W-Bayesian Logistic Regression and Support Vector Machine (W-LRSVM) model was trained for different split ratios to obtain the accuracy of 97.46%. Experimental results on reviews of five different drugs showed that the proposed model gave better results compared to other classifier
- …