701 research outputs found
Sentiment analysis in SemEval: a review of sentiment identification approaches
ocial media platforms are becoming the foundations of social interactions including messaging and opinion expression. In this regard, sentiment analysis techniques focus on providing solutions to ensure the retrieval and analysis of generated data including sentiments, emotions, and discussed topics. International competitions such as the International Workshop on Semantic Evaluation (SemEval) have attracted many researchers and practitioners with a special research interest in building sentiment analysis systems. In our work, we study top-ranking systems for each SemEval edition during the 2013-2021 period, a total of 658 teams participated in these editions with increasing interest over years. We analyze the proposed systems marking the evolution of research trends with a focus on the main components of sentiment analysis systems including data acquisition, preprocessing, and classification. Our study shows an active use of preprocessing techniques, an evolution of features engineering and word representation from lexicon-based approaches to word embeddings, and the dominance of neural networks and transformers over the classification phasefostering the use of ready-to-use models. Moreover, we provide researchers with insights based on experimented systems which will allow rapid prototyping of new systems and help practitioners build for future SemEval editions
URL-BERT: Training Webpage Representations via Social Media Engagements
Understanding and representing webpages is crucial to online social networks
where users may share and engage with URLs. Common language model (LM) encoders
such as BERT can be used to understand and represent the textual content of
webpages. However, these representations may not model thematic information of
web domains and URLs or accurately capture their appeal to social media users.
In this work, we introduce a new pre-training objective that can be used to
adapt LMs to understand URLs and webpages. Our proposed framework consists of
two steps: (1) scalable graph embeddings to learn shallow representations of
URLs based on user engagement on social media and (2) a contrastive objective
that aligns LM representations with the aforementioned graph-based
representation. We apply our framework to the multilingual version of BERT to
obtain the model URL-BERT. We experimentally demonstrate that our continued
pre-training approach improves webpage understanding on a variety of tasks and
Twitter internal and external benchmarks
ORCA: A Challenging Benchmark for Arabic Language Understanding
Due to their crucial role in all NLP, several benchmarks have been proposed
to evaluate pretrained language models. In spite of these efforts, no public
benchmark of diverse nature currently exists for evaluation of Arabic. This
makes it challenging to measure progress for both Arabic and multilingual
language models. This challenge is compounded by the fact that any benchmark
targeting Arabic needs to take into account the fact that Arabic is not a
single language but rather a collection of languages and varieties. In this
work, we introduce ORCA, a publicly available benchmark for Arabic language
understanding evaluation. ORCA is carefully constructed to cover diverse Arabic
varieties and a wide range of challenging Arabic understanding tasks exploiting
60 different datasets across seven NLU task clusters. To measure current
progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between
18 multilingual and Arabic language models. We also provide a public
leaderboard with a unified single-number evaluation metric (ORCA score) to
facilitate future research.Comment: All authors contributed equally. Accepted at ACL 2023, Toronto,
Canad
A Survey on Awesome Korean NLP Datasets
English based datasets are commonly available from Kaggle, GitHub, or
recently published papers. Although benchmark tests with English datasets are
sufficient to show off the performances of new models and methods, still a
researcher need to train and validate the models on Korean based datasets to
produce a technology or product, suitable for Korean processing. This paper
introduces 15 popular Korean based NLP datasets with summarized details such as
volume, license, repositories, and other research results inspired by the
datasets. Also, I provide high-resolution instructions with sample or
statistics of datasets. The main characteristics of datasets are presented on a
single table to provide a rapid summarization of datasets for researchers.Comment: 11 pages, 1 horizontal page for large tabl
GLM-130B: An Open Bilingual Pre-trained Model
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language
model with 130 billion parameters. It is an attempt to open-source a 100B-scale
model at least as good as GPT-3 (davinci) and unveil how models of such a scale
can be successfully pre-trained. Over the course of this effort, we face
numerous unexpected technical and engineering challenges, particularly on loss
spikes and divergence. In this paper, we introduce the training process of
GLM-130B including its design choices, training strategies for both efficiency
and stability, and engineering efforts. The resultant GLM-130B model offers
significant outperformance over GPT-3 175B (davinci) on a wide range of popular
English benchmarks while the performance advantage is not observed in OPT-175B
and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN
3.0 260B -- the largest Chinese language model -- across related benchmarks.
Finally, we leverage a unique scaling property of GLM-130B to reach INT4
quantization without post training, with almost no performance loss, making it
the first among 100B-scale models and more importantly, allowing its effective
inference on 4RTX 3090 (24G) or 8RTX 2080 Ti (11G) GPUs, the
most affordable GPUs required for using 100B-scale models. The GLM-130B model
weights are publicly accessible and its code, training logs, related toolkit,
and lessons learned are open-sourced at
\url{https://github.com/THUDM/GLM-130B/}.Comment: Accepted to ICLR 202
- …