10 research outputs found
Detecting and ordering adjectival scalemates
This paper presents a pattern-based method that can be used to infer
adjectival scales, such as , from a corpus. Specifically,
the proposed method uses lexical patterns to automatically identify and order
pairs of scalemates, followed by a filtering phase in which unrelated pairs are
discarded. For the filtering phase, several different similarity measures are
implemented and compared. The model presented in this paper is evaluated using
the current standard, along with a novel evaluation set, and shown to be at
least as good as the current state-of-the-art.Comment: Paper presented at MAPLEX 2015, February 9-10, Yamagata, Japan
(http://lang.cs.tut.ac.jp/maplex2015/
âWas it good? It was provocative.â Learning the meaning of scalar adjectives
Texts and dialogues often express information indirectly. For instance, speakersâ answers to yes/no questions do not always straightforwardly convey a âyesâ or ânoâ answer. The intended reply is clear in some cases (Was it good? It was great!) but uncertain in others (Was it acceptable? It was unprecedented.). In this paper, we present methods for interpreting the answers to questions like these which involve scalar modifiers. We show how to ground scalar modifier meaning based on data collected from the Web. We learn scales between modifiers and infer the extent to which a given answer conveys âyesâ or ânoâ. To evaluate the methods, we collected examples of questionâanswer pairs involving scalar modifiers from CNN transcripts and the Dialog Act corpus and use response distributions from Mechanical Turk workers to assess the degree to which each answer conveys âyesâ or ânoâ. Our experimental results closely match the Turkersâ response data, demonstrating that meanings can be learned from Web data and that such meanings can drive pragmatic inferenc
Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers
Numeracy is the ability to understand and work with numbers. It is a
necessary skill for composing and understanding documents in clinical,
scientific, and other technical domains. In this paper, we explore different
strategies for modelling numerals with language models, such as memorisation
and digit-by-digit composition, and propose a novel neural architecture that
uses a continuous probability density function to model numerals from an open
vocabulary. Our evaluation on clinical and scientific datasets shows that using
hierarchical models to distinguish numerals from words improves a perplexity
metric on the subset of numerals by 2 and 4 orders of magnitude, respectively,
over non-hierarchical models. A combination of strategies can further improve
perplexity. Our continuous probability density function model reduces mean
absolute percentage errors by 18% and 54% in comparison to the second best
strategy for each dataset, respectively.Comment: accepted at ACL 201
Numeracy for language models: Evaluating and improving their ability to predict numbers
Numeracy is the ability to understand and work with numbers. It is a necessary skill for composing and understanding documents in clinical, scientific, and other technical domains. In this paper, we explore different strategies for modelling numerals with language models, such as memorisation and digit-by-digit composition, and propose a novel neural architecture that uses a continuous probability density function to model numerals from an open vocabulary. Our evaluation on clinical and scientific datasets shows that using hierarchical models to distinguish numerals from words improves a perplexity metric on the subset of numerals by 2 and 4 orders of magnitude, respectively, over non-hierarchical models. A combination of strategies can further improve perplexity. Our continuous probability density function model reduces mean absolute percentage errors by 18% and 54% in comparison to the second best strategy for each dataset, respectively
The value of numbers in clinical text classification
Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process
MULDASA:Multifactor Lexical Sentiment Analysis of Social-Media Content in Nonstandard Arabic Social Media
The semantically complicated Arabic natural vocabulary, and the shortage of available techniques and skills to capture Arabic emotions from text hinder Arabic sentiment analysis (ASA). Evaluating Arabic idioms that do not follow a conventional linguistic framework, such as contemporary standard Arabic (MSA), complicates an incredibly difficult procedure. Here, we define a novel lexical sentiment analysis approach for studying Arabic language tweets (TTs) from specialized digital media platforms. Many elements comprising emoji, intensifiers, negations, and other nonstandard expressions such as supplications, proverbs, and interjections are incorporated into the MULDASA algorithm to enhance the precision of opinion classifications. Root words in multidialectal sentiment LX are associated with emotions found in the content under study via a simple stemming procedure. Furthermore, a featureâsentiment correlation procedure is incorporated into the proposed technique to exclude viewpoints expressed that seem to be irrelevant to the area of concern. As part of our research into Saudi Arabian employability, we compiled a large sample of TTs in 6 different Arabic dialects. This research shows that this sentiment categorization method is useful, and that using all of the characteristics listed earlier improves the ability to accurately classify peopleâs feelings. The classification accuracy of the proposed algorithm improved from 83.84% to 89.80%. Our approach also outperformed two existing research projects that employed a lexical approach for the sentiment analysis of Saudi dialect
HiER 2015. Proceedings des 9. Hildesheimer Evaluierungs- und Retrievalworkshop
Die Digitalisierung formt unsere Informationsumwelten. Disruptive Technologien dringen verstÀrkt und immer schneller in unseren Alltag ein und verÀndern unser Informations- und Kommunikationsverhalten. InformationsmÀrkte wandeln sich. Der 9. Hildesheimer Evaluierungs- und Retrievalworkshop HIER 2015 thematisiert die Gestaltung und Evaluierung von Informationssystemen vor dem Hintergrund der sich beschleunigenden Digitalisierung. Im Fokus stehen die folgenden Themen: Digital Humanities, Internetsuche und Online Marketing, Information Seeking und nutzerzentrierte Entwicklung, E-Learning
HiER 2015 - Proceedings des 9. Hildesheimer Evaluierungs- und Retrievalworkshop
Dieser Band fasst die VortrĂ€ge des 9. Hildesheimer Evaluierungs- und Retrieval-Workshops (HIER) zusammen, der am 9. und 10. Juli 2015 an der UniversitĂ€t Hildesheim stattfand. Die HIER Workshop-Reihe begann im Jahr 2001 mit dem Ziel, die Forschungsergebnisse der Hildesheimer Informationswissenschaft zu prĂ€sentieren und zu diskutieren. Mittlerweile nehmen immer wieder Kooperationspartner von anderen Institutionen teil, was wir sehr begrĂŒĂen. HIER schafft auch ein Forum fĂŒr Systemvorstellungen und praxisorientierte BeitrĂ€ge
Recommended from our members
Sentiment analysis of dialectical Arabic social media content using a hybrid linguistic-machine learning approach
Despite the enormous increase in the number of Arabic posts on social networks, the sentiment analysis research into extracting opinions from these posts lags behind that for the English language. This is largely attributed to the challenges in processing the morphologically complex Arabic natural language and the scarcity of Arabic NLP tools and resources. This complex task is further exacerbated when analysing dialectal Arabic that do not abide by the formal grammatical structure. Based on the semantic modelling of the target domainâs knowledge and multi-factor lexicon-based sentiment analysis, the intent of this research is to use a hybrid approach, integrating linguistic and machine learning methods for sentiment analysis classification of dialectal Arabic. First, a dataset of dialectal Arabic tweets was collected focusing on the unemployment domain, which is annotated manually. The tweets cover different dialectal Arabic in Saudi Arabia for which a comprehensive Arabic sentiment lexicon was constructed. This approach to sentiment analysis also integrated a novel light stemming mechanism towards improved Saudi dialectal Arabic stemming. Subsequently, a novel multi-factor lexicon-based sentiment analysis algorithm was developed for domain-specific social media posts written in dialectal Arabic. The algorithm considers several factors (emoji, intensifiers, negations, supplications) to improve the accuracy of the classifications. Applying this model to a central problem of sentiment analysis in dialectical Arabic, these operational techniques were deployed in order to assess analytical performance across social media channels which are vulnerable to semantic and colloquial variations. Finally, this study presented a new hybrid approach to sentiment analysis where domain knowledge is utilised in two methods to combine computational linguistics and machine learning; the first method integrates the problem domain semantic knowledgebase in the machine learning training features set, while the second uses the outcome of the lexicon-based sentiment classification in the training of the machine learning methods. By integrating these techniques into a single, hybridised solution, a greater degree of accuracy and consistency was achieved than applying each approach independently, confirming a pragmatic solution to sentiment classification in dialectical Arabic text