1,591 research outputs found
Sentiment Analysis for micro-blogging platforms in Arabic
Sentiment Analysis (SA) concerns the automatic extraction and classification of
sentiments conveyed in a given text, i.e. labelling a text instance as positive, negative
or neutral. SA research has attracted increasing interest in the past few years due
to its numerous real-world applications. The recent interest in SA is also fuelled
by the growing popularity of social media platforms (e.g. Twitter), as they provide
large amounts of freely available and highly subjective content that can be readily
crawled.
Most previous SA work has focused on English with considerable success. In
this work, we focus on studying SA in Arabic, as a less-resourced language. This
work reports on a wide set of investigations for SA in Arabic tweets, systematically
comparing three existing approaches that have been shown successful in English.
Specifically, we report experiments evaluating fully-supervised-based (SL), distantsupervision-
based (DS), and machine-translation-based (MT) approaches for SA.
The investigations cover training SA models on manually-labelled (i.e. in SL methods)
and automatically-labelled (i.e. in DS methods) data-sets. In addition, we
explored an MT-based approach that utilises existing off-the-shelf SA systems for
English with no need for training data, assessing the impact of translation errors on
the performance of SA models, which has not been previously addressed for Arabic
tweets. Unlike previous work, we benchmark the trained models against an independent
test-set of >3.5k instances collected at different points in time to account
for topic-shifts issues in the Twitter stream. Despite the challenging noisy medium
of Twitter and the mixture use of Dialectal and Standard forms of Arabic, we show
that our SA systems are able to attain performance scores on Arabic tweets that
are comparable to the state-of-the-art SA systems for English tweets.
The thesis also investigates the role of a wide set of features, including syntactic,
semantic, morphological, language-style and Twitter-specific features. We introduce
a set of affective-cues/social-signals features that capture information about the
presence of contextual cues (e.g. prayers, laughter, etc.) to correlate them with the
sentiment conveyed in an instance. Our investigations reveal a generally positive
impact for utilising these features for SA in Arabic. Specifically, we show that a rich
set of morphological features, which has not been previously used, extracted using
a publicly-available morphological analyser for Arabic can significantly improve the
performance of SA classifiers. We also demonstrate the usefulness of languageindependent
features (e.g. Twitter-specific) for SA. Our feature-sets outperform
results reported in previous work on a previously built data-set
Arabic Educational Neural Network Chatbot
Chatbots (machine-based conversational systems) have grown in popularity in recent years. Chatbots powered by artificial intelligence (AI) are sophisticated technologies that replicate human communication in a range of natural languages. A chatbot’s primary purpose is to interpret user inquiries and give relevant, contextual responses. Chatbot success has been extensively reported in a number of widely spoken languages; nonetheless, chatbots have not yet reached the predicted degree of success in Arabic. In recent years, several academics have worked to solve the challenges of creating Arabic chatbots. Furthermore, the development of Arabic chatbots is critical to our attempts to increase the use of the language in academic contexts. Our objective is to install and create an Arabic chatbot that will help the Arabic language in the area of education. To begin implementing the chabot, we collected datasets from Arabic educational websites and had to prepare these data using the NLP methods. We then used this data to train the system using a neural network model to create an Arabic neural network chabot. Furthermore, we found relevant research, conducted earlier investigations, and compared their findings by searching Google scholar and looking through the linked references. Data was gathered and saved in a json file. Finally, we programmed the chabot and the models in Python. As a consequence, an Arabic chatbot answers all questions about educational regulations in the United Arab Emirates
Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features
Satirical news is considered to be entertainment, but it is potentially
deceptive and harmful. Despite the embedded genre in the article, not everyone
can recognize the satirical cues and therefore believe the news as true news.
We observe that satirical cues are often reflected in certain paragraphs rather
than the whole document. Existing works only consider document-level features
to detect the satire, which could be limited. We consider paragraph-level
linguistic features to unveil the satire by incorporating neural network and
attention mechanism. We investigate the difference between paragraph-level
features and document-level features, and analyze them on a large satirical
news dataset. The evaluation shows that the proposed model detects satirical
news effectively and reveals what features are important at which level.Comment: EMNLP 2017, 11 page
NewsPercept: A Simple Data Science Pipeline for Online News Perception Mining
NewsPercept: A Simple Data Science Pipeline for Online News Perception Mining presenterer en forskning hvor man, ved hjelp av sentiment analyse kan trekke ut kunnskap om en bedrift eller organisasjon i en nyhetsartikkel. Videre, så presenterer forskningen en fremgangsmåte på å identifisere organisasjonene ved hjelp av navn og antall nevninger i tekst.Masteroppgave i informasjonsvitenskapINFO390MASV-IKTMASV-INF
Exploiting word embeddings for modeling bilexical relations
There has been an exponential surge of text data in the recent years. As a consequence, unsupervised methods that make use of this data have been steadily growing in the field of natural language processing (NLP). Word embeddings are low-dimensional vectors obtained using unsupervised techniques on the large unlabelled corpora, where words from the vocabulary are mapped to vectors of real numbers. Word embeddings aim to capture syntactic and semantic properties of words.
In NLP, many tasks involve computing the compatibility between lexical items under some linguistic relation. We call this type of relation a bilexical relation. Our thesis defines statistical models for bilexical relations
that centrally make use of word embeddings. Our principle aim is that the word embeddings will favor generalization to words not seen during the training of the model.
The thesis is structured in four parts. In the first part of this thesis, we present a bilinear model over word embeddings that leverages a small supervised dataset for a binary linguistic relation. Our learning algorithm exploits low-rank bilinear forms and induces a low-dimensional embedding tailored for a target linguistic relation. This results in compressed task-specific embeddings.
In the second part of our thesis, we extend our bilinear model to a ternary
setting and propose a framework for resolving prepositional phrase attachment ambiguity using word embeddings. Our models perform competitively with state-of-the-art models. In addition, our method obtains significant improvements on out-of-domain tests by simply using word-embeddings induced from source and target domains.
In the third part of this thesis, we further extend the bilinear models for expanding vocabulary in the context of statistical phrase-based machine translation. Our model obtains a probabilistic list of possible translations of target language words, given a word in the source language. We do this by projecting pre-trained embeddings into a common subspace using a log-bilinear model. We empirically notice a significant improvement on an out-of-domain test set.
In the final part of our thesis, we propose a non-linear model that maps initial word embeddings to task-tuned word embeddings, in the context of a neural network dependency parser. We demonstrate its use for improved dependency parsing, especially for sentences with unseen words. We also show downstream improvements on a sentiment analysis task.En els darrers anys hi ha hagut un sorgiment notable de dades en format textual. Conseqüentment, en el camp del Processament del Llenguatge Natural (NLP, de l'anglès "Natural Language Processing") s'han desenvolupat mètodes no supervistats que fan ús d'aquestes dades. Els anomenats "word embeddings", o embeddings de paraules, són vectors de dimensionalitat baixa que s'obtenen mitjançant tècniques no supervisades aplicades a corpus textuals de grans volums. Com a resultat, cada paraula del diccionari es correspon amb un vector de nombres reals, el propòsit del qual és capturar propietats sintà ctiques i semà ntiques de la paraula corresponent. Moltes tasques de NLP involucren calcular la compatibilitat entre elements lèxics en l'à mbit d'una relació lingüÃstica. D'aquest tipus de relació en diem relació bilèxica. Aquesta tesi proposa models estadÃstics per a relacions bilèxiques que fan ús central d'embeddings de paraules, amb l'objectiu de millorar la generalització del model lingüÃstic a paraules no vistes durant l'entrenament. La tesi s'estructura en quatre parts. A la primera part presentem un model bilineal sobre embeddings de paraules que explota un conjunt petit de dades anotades sobre una relaxió bilèxica. L'algorisme d'aprenentatge treballa amb formes bilineals de poc rang, i indueix embeddings de poca dimensionalitat que estan especialitzats per la relació bilèxica per la qual s'han entrenat. Com a resultat, obtenim embeddings de paraules que corresponen a compressions d'embeddings per a una relació determinada. A la segona part de la tesi proposem una extensió del model bilineal a trilineal, i amb això proposem un nou model per a resoldre ambigüitats de sintagmes preposicionals que usa només embeddings de paraules. En una sèrie d'avaluacións, els nostres models funcionen de manera similar a l'estat de l'art. A més, el nostre mètode obté millores significatives en avaluacions en textos de dominis diferents al d'entrenament, simplement usant embeddings induïts amb textos dels dominis d'entrenament i d'avaluació. A la tercera part d'aquesta tesi proposem una altra extensió dels models bilineals per ampliar la cobertura lèxica en el context de models estadÃstics de traducció automà tica. El nostre model probabilÃstic obté, donada una paraula en la llengua d'origen, una llista de possibles traduccions en la llengua de destÃ. Fem això mitjançant una projecció d'embeddings pre-entrenats a un sub-espai comú, usant un model log-bilineal. EmpÃricament, observem una millora significativa en avaluacions en dominis diferents al d'entrenament. Finalment, a la quarta part de la tesi proposem un model no lineal que indueix una correspondència entre embeddings inicials i embeddings especialitzats, en el context de tasques d'anà lisi sintà ctica de dependències amb models neuronals. Mostrem que aquest mètode millora l'analisi de dependències, especialment en oracions amb paraules no vistes durant l'entrenament. També mostrem millores en un tasca d'anà lisi de sentiment
When linguistics meets web technologies. Recent advances in modelling linguistic linked data
This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments in the area and both builds upon and complements previous works covering similar territory. The article begins with an overview of recent trends which have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding of several major projects in which LLD is a key component, and the increasing importance of the relationship of the digital humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this we look at some of the latest developments in community standards and initiatives such as OntoLex-Lemon as well as recent work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies META-SHARE and lime and language identifiers. In the following part of the paper we look at work which has been realised in a number of recent projects and which has a significant impact on LLD vocabularies and models
- …