44 research outputs found
Data Sets: Word Embeddings Learned from Tweets and General Data
A word embedding is a low-dimensional, dense and real- valued vector
representation of a word. Word embeddings have been used in many NLP tasks.
They are usually gener- ated from a large text corpus. The embedding of a word
cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and
have unique lexical and semantic features that are different from other types
of text. Therefore, it is necessary to have word embeddings learned
specifically from tweets. In this paper, we present ten word embedding data
sets. In addition to the data sets learned from just tweet data, we also built
embedding sets from the general data and the combination of tweets with the
general data. The general data consist of news articles, Wikipedia data and
other web data. These ten embedding models were learned from about 400 million
tweets and 7 billion words from the general text. In this paper, we also
present two experiments demonstrating how to use the data sets in some NLP
tasks, such as tweet sentiment analysis and tweet topic classification tasks
Search-based Optimisation of LLM Learning Shots for Story Point Estimation
One of the ways Large Language Models (LLMs) are used to perform machine
learning tasks is to provide them with a few examples before asking them to
produce a prediction. This is a meta-learning process known as few-shot
learning. In this paper, we use available Search-Based methods to optimise the
number and combination of examples that can improve an LLM's estimation
performance, when it is used to estimate story points for new agile tasks. Our
preliminary results show that our SBSE technique improves the estimation
performance of the LLM by 59.34% on average (in terms of mean absolute error of
the estimation) over three datasets against a zero-shot setting.Comment: 6 pages, Accepted at SSBSE'23 NIER Trac
Unsupervised Domain Adaptation using Lexical Transformations and Label Injection for Twitter Data
Domain adaptation is an important and widely studied problem in natural
language processing. A large body of literature tries to solve this problem by
adapting models trained on the source domain to the target domain. In this
paper, we instead solve this problem from a dataset perspective. We modify the
source domain dataset with simple lexical transformations to reduce the domain
shift between the source dataset distribution and the target dataset
distribution. We find that models trained on the transformed source domain
dataset performs significantly better than zero-shot models. Using our proposed
transformations to convert standard English to tweets, we reach an unsupervised
part-of-speech (POS) tagging accuracy of 92.14% (from 81.54% zero shot
accuracy), which is only slightly below the supervised performance of 94.45%.
We also use our proposed transformations to synthetically generate tweets and
augment the Twitter dataset to achieve state-of-the-art performance for POS
tagging.Comment: Accepted at WASSA at ACL 202
A Knowledge Adoption Model Based Framework for Finding Helpful User-Generated Contents in Online Communities
Many online communities allow their members to provide information helpfulness judgments that can be used to guide other users to useful contents quickly. However, it is a serious challenge to solicit enough user participation in providing feedbacks in online communities. Existing studies on assessing the helpfulness of user-generated contents are mainly based on heuristics and lack of a unifying theoretical framework. In this article we propose a text classification framework for finding helpful user-generated contents in online knowledge-sharing communities. The objective of our framework is to help a knowledge seeker find helpful information that can be potentially adopted. The framework is built on the Knowledge Adoption Model that considers both content-based argument quality and information source credibility. We identify 6 argument quality dimensions and 3 source credibility dimensions based on information quality and psychological theories. Using data extracted from a popular online community, our empirical evaluations show that all the dimensions improve the performance over a traditional text classification technique that considers word-based lexical features only
Modulating Scalable Gaussian Processes for Expressive Statistical Learning
For a learning task, Gaussian process (GP) is interested in learning the
statistical relationship between inputs and outputs, since it offers not only
the prediction mean but also the associated variability. The vanilla GP however
struggles to learn complicated distribution with the property of, e.g.,
heteroscedastic noise, multi-modality and non-stationarity, from massive data
due to the Gaussian marginal and the cubic complexity. To this end, this
article studies new scalable GP paradigms including the non-stationary
heteroscedastic GP, the mixture of GPs and the latent GP, which introduce
additional latent variables to modulate the outputs or inputs in order to learn
richer, non-Gaussian statistical representation. We further resort to different
variational inference strategies to arrive at analytical or tighter evidence
lower bounds (ELBOs) of the marginal likelihood for efficient and effective
model training. Extensive numerical experiments against state-of-the-art GP and
neural network (NN) counterparts on various tasks verify the superiority of
these scalable modulated GPs, especially the scalable latent GP, for learning
diverse data distributions.Comment: 31 pages, 9 figures, 4 tables, preprint under revie
AIR-JPMC@SMM4H'22: Classifying Self-Reported Intimate Partner Violence in Tweets with Multiple BERT-based Models
This paper presents our submission for the SMM4H 2022-Shared Task on the
classification of self-reported intimate partner violence on Twitter (in
English). The goal of this task was to accurately determine if the contents of
a given tweet demonstrated someone reporting their own experience with intimate
partner violence. The submitted system is an ensemble of five RoBERTa models
each weighted by their respective F1-scores on the validation data-set. This
system performed 13% better than the baseline and was the best performing
system overall for this shared task