3,399 research outputs found
Impact of Tokenization on Language Models: An Analysis for Turkish
Tokenization is an important text preprocessing step to prepare input tokens
for deep language models. WordPiece and BPE are de facto methods employed by
important models, such as BERT and GPT. However, the impact of tokenization can
be different for morphologically rich languages, such as Turkic languages,
where many words can be generated by adding prefixes and suffixes. We compare
five tokenizers at different granularity levels, i.e. their outputs vary from
smallest pieces of characters to the surface form of words, including a
Morphological-level tokenizer. We train these tokenizers and pretrain
medium-sized language models using RoBERTa pretraining procedure on the Turkish
split of the OSCAR corpus. We then fine-tune our models on six downstream
tasks. Our experiments, supported by statistical tests, reveal that
Morphological-level tokenizer has challenging performance with de facto
tokenizers. Furthermore, we find that increasing the vocabulary size improves
the performance of Morphological and Word-level tokenizers more than that of de
facto tokenizers. The ratio of the number of vocabulary parameters to the total
number of model parameters can be empirically chosen as 20% for de facto
tokenizers and 40% for other tokenizers to obtain a reasonable trade-off
between model size and performance.Comment: submitted to ACM TALLI
Preparation of Improved Turkish DataSet for Sentiment Analysis in Social Media
A public dataset, with a variety of properties suitable for sentiment
analysis [1], event prediction, trend detection and other text mining
applications, is needed in order to be able to successfully perform analysis
studies. The vast majority of data on social media is text-based and it is not
possible to directly apply machine learning processes into these raw data,
since several different processes are required to prepare the data before the
implementation of the algorithms. For example, different misspellings of same
word enlarge the word vector space unnecessarily, thereby it leads to reduce
the success of the algorithm and increase the computational power requirement.
This paper presents an improved Turkish dataset with an effective spelling
correction algorithm based on Hadoop [2]. The collected data is recorded on the
Hadoop Distributed File System and the text based data is processed by
MapReduce programming model. This method is suitable for the storage and
processing of large sized text based social media data. In this study, movie
reviews have been automatically recorded with Apache ManifoldCF (MCF) [3] and
data clusters have been created. Various methods compared such as Levenshtein
and Fuzzy String Matching have been proposed to create a public dataset from
collected data. Experimental results show that the proposed algorithm, which
can be used as an open source dataset in sentiment analysis studies, have been
performed successfully to the detection and correction of spelling errors.Comment: Presented at CMES201
Semantic Sentiment Analysis of Twitter Data
Internet and the proliferation of smart mobile devices have changed the way
information is created, shared, and spreads, e.g., microblogs such as Twitter,
weblogs such as LiveJournal, social networks such as Facebook, and instant
messengers such as Skype and WhatsApp are now commonly used to share thoughts
and opinions about anything in the surrounding world. This has resulted in the
proliferation of social media content, thus creating new opportunities to study
public opinion at a scale that was never possible before. Naturally, this
abundance of data has quickly attracted business and research interest from
various fields including marketing, political science, and social studies,
among many others, which are interested in questions like these: Do people like
the new Apple Watch? Do Americans support ObamaCare? How do Scottish feel about
the Brexit? Answering these questions requires studying the sentiment of
opinions people express in social media, which has given rise to the fast
growth of the field of sentiment analysis in social media, with Twitter being
especially popular for research due to its scale, representativeness, variety
of topics discussed, as well as ease of public access to its messages. Here we
present an overview of work on sentiment analysis on Twitter.Comment: Microblog sentiment analysis; Twitter opinion mining; In the
Encyclopedia on Social Network Analysis and Mining (ESNAM), Second edition.
201
Digital watermarking: a state-of-the-art review
Digital watermarking is the art of embedding data, called a
watermark, into a multimedia object such that the watermark can be detected or
extracted later without impairing the object. Concealment of secret messages inside a
natural language, known as steganography, has been in existence as early as the 16th
century. However, the increase in electronic/digital information transmission and
distribution has resulted in the spread of watermarking from ordinary text to
multimedia transmission. In this paper, we review various approaches and methods
that have been used to conceal and preserve messages. Examples of real-world
applications are also discussed.SANPAD, Telkom, Cisco, Aria Technologies, THRIPDepartment of HE and Training approved lis
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
We present the results and main findings of SemEval-2020 Task 12 on
Multilingual Offensive Language Identification in Social Media (OffensEval
2020). The task involves three subtasks corresponding to the hierarchical
taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The
task featured five languages: English, Arabic, Danish, Greek, and Turkish for
Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020
was one of the most popular tasks at SemEval-2020 attracting a large number of
participants across all subtasks and also across all languages. A total of 528
teams signed up to participate in the task, 145 teams submitted systems during
the evaluation period, and 70 submitted system description papers.Comment: Proceedings of the International Workshop on Semantic Evaluation
(SemEval-2020
- …