Search CORE

38 research outputs found

Data Sets: Word Embeddings Learned from Tweets and General Data

Author: Li Quanzhi
Liu Xiaomo
Nourbakhsh Armineh
Shah Sameena
Publication venue
Publication date: 03/05/2017
Field of study

A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Unsupervised Domain Adaptation using Lexical Transformations and Label Injection for Twitter Data

Author: Gupta Akshat
Liu Xiaomo
Shah Sameena
Publication venue
Publication date: 14/07/2023
Field of study

Domain adaptation is an important and widely studied problem in natural language processing. A large body of literature tries to solve this problem by adapting models trained on the source domain to the target domain. In this paper, we instead solve this problem from a dataset perspective. We modify the source domain dataset with simple lexical transformations to reduce the domain shift between the source dataset distribution and the target dataset distribution. We find that models trained on the transformed source domain dataset performs significantly better than zero-shot models. Using our proposed transformations to convert standard English to tweets, we reach an unsupervised part-of-speech (POS) tagging accuracy of 92.14% (from 81.54% zero shot accuracy), which is only slightly below the supervised performance of 94.45%. We also use our proposed transformations to synthetically generate tweets and augment the Twitter dataset to achieve state-of-the-art performance for POS tagging.Comment: Accepted at WASSA at ACL 202

arXiv.org e-Print Archive

A Knowledge Adoption Model Based Framework for Finding Helpful User-Generated Contents in Online Communities

Author: Fan Weiguo
Liu Xiaomo
Wang Gang
Publication venue: AIS Electronic Library (AISeL)
Publication date: 06/12/2011
Field of study

Many online communities allow their members to provide information helpfulness judgments that can be used to guide other users to useful contents quickly. However, it is a serious challenge to solicit enough user participation in providing feedbacks in online communities. Existing studies on assessing the helpfulness of user-generated contents are mainly based on heuristics and lack of a unifying theoretical framework. In this article we propose a text classification framework for finding helpful user-generated contents in online knowledge-sharing communities. The objective of our framework is to help a knowledge seeker find helpful information that can be potentially adopted. The framework is built on the Knowledge Adoption Model that considers both content-based argument quality and information source credibility. We identify 6 argument quality dimensions and 3 source credibility dimensions based on information quality and psychological theories. Using data extracted from a popular online community, our empirical evaluations show that all the dimensions improve the performance over a traditional text classification technique that considers word-based lexical features only

AIS Electronic Library (AISeL)

Modulating Scalable Gaussian Processes for Expressive Statistical Learning

Author: Jiang Xiaomo
Liu Haitao
Ong Yew-Soon
Wang Xiaofang
Publication venue
Publication date: 29/08/2020
Field of study

For a learning task, Gaussian process (GP) is interested in learning the statistical relationship between inputs and outputs, since it offers not only the prediction mean but also the associated variability. The vanilla GP however struggles to learn complicated distribution with the property of, e.g., heteroscedastic noise, multi-modality and non-stationarity, from massive data due to the Gaussian marginal and the cubic complexity. To this end, this article studies new scalable GP paradigms including the non-stationary heteroscedastic GP, the mixture of GPs and the latent GP, which introduce additional latent variables to modulate the outputs or inputs in order to learn richer, non-Gaussian statistical representation. We further resort to different variational inference strategies to arrive at analytical or tighter evidence lower bounds (ELBOs) of the marginal likelihood for efficient and effective model training. Extensive numerical experiments against state-of-the-art GP and neural network (NN) counterparts on various tasks verify the superiority of these scalable modulated GPs, especially the scalable latent GP, for learning diverse data distributions.Comment: 31 pages, 9 figures, 4 tables, preprint under revie

arXiv.org e-Print Archive

DR-NTU (Digital Repository of NTU)

AIR-JPMC@SMM4H'22: Classifying Self-Reported Intimate Partner Violence in Tweets with Multiple BERT-based Models

Author: Candidato Alec
Gupta Akshat
Liu Xiaomo
Shah Sameena
Publication venue
Publication date: 21/09/2022
Field of study

This paper presents our submission for the SMM4H 2022-Shared Task on the classification of self-reported intimate partner violence on Twitter (in English). The goal of this task was to accurately determine if the contents of a given tweet demonstrated someone reporting their own experience with intimate partner violence. The submitted system is an ensemble of five RoBERTa models each weighted by their respective F1-scores on the validation data-set. This system performed 13% better than the baseline and was the best performing system overall for this shared task

arXiv.org e-Print Archive