39 research outputs found
Short Text Classification using Contextual Analysis
Peer reviewedPublisher PD
Sentiment Analysis of COVID-19 Vaccines in Indonesia on Twitter Using Pre-Trained and Self-Training Word Embeddings
Sentiment analysis regarding the COVID-19 vaccine can be obtained from social media because users usually express their opinions through social media. One of the social media that is most often used by Indonesian people to express their opinion is Twitter. The method used in this research is Bidirectional LSTM which will be combined with word embedding. In this study, fastText and GloVe were tested as word embedding. We created 8 test scenarios to inspect performance of the word embeddings, using both pre-trained and self-trained word embedding vectors. Dataset gathered from Twitter was prepared as stemmed dataset and unstemmed dataset. The highest accuracy from GloVe scenario group was generated by model which used self-trained GloVe and trained on unstemmed dataset. The accuracy reached 92.5%. On the other hand, the highest accuracy from fastText scenario group generated by model which used self-trained fastText and trained on stemmed dataset. The accuracy reached 92.3%. In other scenarios that used pre-trained embedding vector, the accuracy was quite lower than scenarios that used self-trained embedding vector, because the pre-trained embedding data was trained using the Wikipedia corpus which contains standard and well-structured language while the dataset used in this study came from Twitter which contains non-standard sentences. Even though the dataset was processed using stemming and slang words dictionary, the pre-trained embedding still can not recognize several words from our dataset
DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus
This paper focuses on how to extract opinions over each Persian
sentence-level text. Deep learning models provided a new way to boost the
quality of the output. However, these architectures need to feed on big
annotated data as well as an accurate design. To best of our knowledge, we do
not merely suffer from lack of well-annotated Persian sentiment corpus, but
also a novel model to classify the Persian opinions in terms of both multiple
and binary classification. So in this work, first we propose two novel deep
learning architectures comprises of bidirectional LSTM and CNN. They are a part
of a deep hierarchy designed precisely and also able to classify sentences in
both cases. Second, we suggested three data augmentation techniques for the
low-resources Persian sentiment corpus. Our comprehensive experiments on three
baselines and two different neural word embedding methods show that our data
augmentation methods and intended models successfully address the aims of the
research
Unionization method for changing opinion in sentiment classification using machine learning
Sentiment classification aims to determine whether an opinionated text expresses a positive, negative or neutral opinion. Most existing sentiment classification approaches have focused on supervised text classification techniques. One critical problem of sentiment classification is that a text collection may contain tens or hundreds of thousands of features, i.e. high dimensionality, which can be solved by dimension reduction approach. Nonetheless, although feature selection as a dimension reduction method can reduce feature space to provide a reduced feature subset, the size of the subset commonly requires further reduction. In this research, a novel dimension reduction approach called feature unionization is proposed to construct a more reduced feature subset. This approach works based on the combination of several features to create a more informative single feature. Another challenge of sentiment classification is the handling of concept drift problem in the learning step. Users’ opinions are changed due to evolution of target entities over time. However, the existing sentiment classification approaches do not consider the evolution of users’ opinions. They assume that instances are independent, identically distributed and generated from a stationary distribution, even though they are generated from a stream distribution. In this study, a stream sentiment classification method is proposed to deal with changing opinion and imbalanced data distribution using ensemble learning and instance selection methods. In relation to the concept drift problem, another important issue is the handling of feature drift in the sentiment classification. To handle feature drift, relevant features need to be detected to update classifiers. Since proposed feature unionization method is very effective to construct more relevant features, it is further used to handle feature drift. Thus, a method to deal with concept and feature drifts for stream sentiment classification was proposed. The effectiveness of the feature unionization method was compared with the feature selection method over fourteen publicly available datasets in sentiment classification domain using three typical classifiers. The experimental results showed the proposed approach is more effective than current feature selection approaches. In addition, the experimental results showed the effectiveness of the proposed stream sentiment classification method in comparison to static sentiment classification. The experiments conducted on four datasets, have successfully shown that the proposed algorithm achieved better results and proving the effectiveness of the proposed method
Research and analysis of hate and other emotions in social media
Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2022, Director: Maria Salamó Llorente[en] In the course of just a few years, with the massive introduction of social media, people have changed the way they communicate and share experiences dramatically. The global scale that this topic has reached, combined with its rapid expansion, is a historic landmark. However, what do social networks represent in our day-to-day lifestyles? The answer is a double life. Since their launch, a digital pseudo-reality has been created in which
thoughts, emotions and privacy can be expressed in detail. This leads us to dump each of society’s concerns into community applications, and if you add the factor of anonymity behind a screen, the result is incendiary.
Through this work, it is intended to identify, study and analyze the high level of emotions, mostly negative, that has been flooding social media thanks to the aforementioned anonymity. This process will be carried out by entering the Natural Language Processing field. For this purpose, a study of Hate Speech, Toxicity, Offensiveness and other emotions will be carried out on four datasets, each one with one of these tasks respectively. Using these datasets, three language models, based on Transformers and Deep Learning, will be trained and validated for their future comparison.
All of this is performed with the aim of finding the ideal framework for each of the featured tasks, which are based on true-to-life situations. Furthermore, it is intended to find the causes of the inconveniences that the models may present, in a concise and intuitive way for the reader
Multilingual Twitter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition
Existing research on fairness evaluation of document classification models
mainly uses synthetic monolingual data without ground truth for author
demographic attributes. In this work, we assemble and publish a multilingual
Twitter corpus for the task of hate speech detection with inferred four author
demographic factors: age, country, gender and race/ethnicity. The corpus covers
five languages: English, Italian, Polish, Portuguese and Spanish. We evaluate
the inferred demographic labels with a crowdsourcing platform, Figure Eight. To
examine factors that can cause biases, we take an empirical analysis of
demographic predictability on the English corpus. We measure the performance of
four popular document classifiers and evaluate the fairness and bias of the
baseline classifiers on the author-level demographic attributes.Comment: Accepted at LREC 202
Mining Behavior of Citizen Sensor Communities to Improve Cooperation with Organizational Actors
Web 2.0 (social media) provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens generate content for sharing information and engaging in discussions. Such a citizen sensor community (CSC) has stated or implied goals that are helpful in the work of formal organizations, such as an emergency management unit, for prioritizing their response needs. This research addresses questions related to design of a cooperative system of organizations and citizens in CSC. Prior research by social scientists in a limited offline and online environment has provided a foundation for research on cooperative behavior challenges, including \u27articulation\u27 and \u27awareness\u27, but Web 2.0 supported CSC offers new challenges as well as opportunities. A CSC presents information overload for the organizational actors, especially in finding reliable information providers (for awareness), and finding actionable information from the data generated by citizens (for articulation). Also, we note three data level challenges: ambiguity in interpreting unconstrained natural language text, sparsity of user behaviors, and diversity of user demographics. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues. I present a novel web information-processing framework, called the Identify-Match- Engage (IME) framework. IME allows operationalizing computation in design problems of awareness and articulation of the cooperative system between citizens and organizations, by addressing data problems of group engagement modeling and intent mining. The IME framework includes: a.) Identification of cooperation-assistive intent (seeking-offering) from short, unstructured messages using a classification model with declarative, social and contrast pattern knowledge, b.) Facilitation of coordination modeling using bipartite matching of complementary intent (seeking-offering), and c.) Identification of user groups to prioritize for engagement by defining a content-driven measure of \u27group discussion divergence\u27. The use of prior knowledge and interplay of features of users, content, and network structures efficiently captures context for computing cooperation-assistive behavior (intent and engagement) from unstructured social data in the online socio-technical systems. Our evaluation of a use-case of the crisis response domain shows improvement in performance for both intent classification and group engagement prioritization. Real world applications of this work include use of the engagement interface tool during various recent crises including the 2014 Jammu and Kashmir floods, and intent classification as a service integrated by the crisis mapping pioneer Ushahidi\u27s CrisisNET project for broader impact
Rumor Stance Classification in Online Social Networks: A Survey on the State-of-the-Art, Prospects, and Future Challenges
The emergence of the Internet as a ubiquitous technology has facilitated the
rapid evolution of social media as the leading virtual platform for
communication, content sharing, and information dissemination. In spite of
revolutionizing the way news used to be delivered to people, this technology
has also brought along with itself inevitable demerits. One such drawback is
the spread of rumors facilitated by social media platforms which may provoke
doubt and fear upon people. Therefore, the need to debunk rumors before their
wide spread has become essential all the more. Over the years, many studies
have been conducted to develop effective rumor verification systems. One aspect
of such studies focuses on rumor stance classification, which concerns the task
of utilizing users' viewpoints about a rumorous post to better predict the
veracity of a rumor. Relying on users' stances in rumor verification task has
gained great importance, for it has shown significant improvements in the model
performances. In this paper, we conduct a comprehensive literature review on
rumor stance classification in complex social networks. In particular, we
present a thorough description of the approaches and mark the top performances.
Moreover, we introduce multiple datasets available for this purpose and
highlight their limitations. Finally, some challenges and future directions are
discussed to stimulate further relevant research efforts.Comment: 13 pages, 2 figures, journa
Objective-Based Hierarchical Clustering of Deep Embedding Vectors
We initiate a comprehensive experimental study of objective-based
hierarchical clustering methods on massive datasets consisting of deep
embedding vectors from computer vision and NLP applications. This includes a
large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word
embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from
several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our
study includes datasets with up to million entries with embedding
dimensions up to .
In order to address the challenge of scaling up hierarchical clustering to
such large datasets we propose a new practical hierarchical clustering
algorithm B++&C. It gives a 5%/20% improvement on average for the popular
Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared
to a wide range of classic methods and recent heuristics. We also introduce a
theoretical algorithm B2SAT&C which achieves a -approximation for the
CKMM objective in polynomial time. This is the first substantial improvement
over the trivial -approximation achieved by a random binary tree. Prior to
this work, the best poly-time approximation of was due
to Charikar et al. (SODA'19)