44 research outputs found

    Data Sets: Word Embeddings Learned from Tweets and General Data

    Full text link
    A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks

    Search-based Optimisation of LLM Learning Shots for Story Point Estimation

    Full text link
    One of the ways Large Language Models (LLMs) are used to perform machine learning tasks is to provide them with a few examples before asking them to produce a prediction. This is a meta-learning process known as few-shot learning. In this paper, we use available Search-Based methods to optimise the number and combination of examples that can improve an LLM's estimation performance, when it is used to estimate story points for new agile tasks. Our preliminary results show that our SBSE technique improves the estimation performance of the LLM by 59.34% on average (in terms of mean absolute error of the estimation) over three datasets against a zero-shot setting.Comment: 6 pages, Accepted at SSBSE'23 NIER Trac

    Unsupervised Domain Adaptation using Lexical Transformations and Label Injection for Twitter Data

    Full text link
    Domain adaptation is an important and widely studied problem in natural language processing. A large body of literature tries to solve this problem by adapting models trained on the source domain to the target domain. In this paper, we instead solve this problem from a dataset perspective. We modify the source domain dataset with simple lexical transformations to reduce the domain shift between the source dataset distribution and the target dataset distribution. We find that models trained on the transformed source domain dataset performs significantly better than zero-shot models. Using our proposed transformations to convert standard English to tweets, we reach an unsupervised part-of-speech (POS) tagging accuracy of 92.14% (from 81.54% zero shot accuracy), which is only slightly below the supervised performance of 94.45%. We also use our proposed transformations to synthetically generate tweets and augment the Twitter dataset to achieve state-of-the-art performance for POS tagging.Comment: Accepted at WASSA at ACL 202

    A Knowledge Adoption Model Based Framework for Finding Helpful User-Generated Contents in Online Communities

    Get PDF
    Many online communities allow their members to provide information helpfulness judgments that can be used to guide other users to useful contents quickly. However, it is a serious challenge to solicit enough user participation in providing feedbacks in online communities. Existing studies on assessing the helpfulness of user-generated contents are mainly based on heuristics and lack of a unifying theoretical framework. In this article we propose a text classification framework for finding helpful user-generated contents in online knowledge-sharing communities. The objective of our framework is to help a knowledge seeker find helpful information that can be potentially adopted. The framework is built on the Knowledge Adoption Model that considers both content-based argument quality and information source credibility. We identify 6 argument quality dimensions and 3 source credibility dimensions based on information quality and psychological theories. Using data extracted from a popular online community, our empirical evaluations show that all the dimensions improve the performance over a traditional text classification technique that considers word-based lexical features only

    Modulating Scalable Gaussian Processes for Expressive Statistical Learning

    Full text link
    For a learning task, Gaussian process (GP) is interested in learning the statistical relationship between inputs and outputs, since it offers not only the prediction mean but also the associated variability. The vanilla GP however struggles to learn complicated distribution with the property of, e.g., heteroscedastic noise, multi-modality and non-stationarity, from massive data due to the Gaussian marginal and the cubic complexity. To this end, this article studies new scalable GP paradigms including the non-stationary heteroscedastic GP, the mixture of GPs and the latent GP, which introduce additional latent variables to modulate the outputs or inputs in order to learn richer, non-Gaussian statistical representation. We further resort to different variational inference strategies to arrive at analytical or tighter evidence lower bounds (ELBOs) of the marginal likelihood for efficient and effective model training. Extensive numerical experiments against state-of-the-art GP and neural network (NN) counterparts on various tasks verify the superiority of these scalable modulated GPs, especially the scalable latent GP, for learning diverse data distributions.Comment: 31 pages, 9 figures, 4 tables, preprint under revie

    AIR-JPMC@SMM4H'22: Classifying Self-Reported Intimate Partner Violence in Tweets with Multiple BERT-based Models

    Full text link
    This paper presents our submission for the SMM4H 2022-Shared Task on the classification of self-reported intimate partner violence on Twitter (in English). The goal of this task was to accurately determine if the contents of a given tweet demonstrated someone reporting their own experience with intimate partner violence. The submitted system is an ensemble of five RoBERTa models each weighted by their respective F1-scores on the validation data-set. This system performed 13% better than the baseline and was the best performing system overall for this shared task
    corecore