10,551 research outputs found
Inferring Affective Meanings of Words from Word Embedding
Affective lexicon is one of the most important resource in affective computing for text. Manually constructed affective lexicons have limited scale and thus only have limited use in practical systems. In this work, we propose a regression-based method to automatically infer multi-dimensional affective representation of words via their word embedding based on a set of seed words. This method can make use of the rich semantic meanings obtained from word embedding to extract meanings in some specific semantic space. This is based on the assumption that different features in word embedding contribute differently to a particular affective dimension and a particular feature in word embedding contributes differently to different affective dimensions. Evaluation on various affective lexicons shows that our method outperforms the state-of-the-art methods on all the lexicons under different evaluation metrics with large margins. We also explore different regression models and conclude that the Ridge regression model, the Bayesian Ridge regression model and Support Vector Regression with linear kernel are the most suitable models. Comparing to other state-of-the-art methods, our method also has computation advantage. Experiments on a sentiment analysis task show that the lexicons extended by our method achieve better results than publicly available sentiment lexicons on eight sentiment corpora. The extended lexicons are publicly available for access
Text-based Sentiment Analysis and Music Emotion Recognition
Nowadays, with the expansion of social media, large amounts of user-generated
texts like tweets, blog posts or product reviews are shared online. Sentiment polarity
analysis of such texts has become highly attractive and is utilized in recommender
systems, market predictions, business intelligence and more. We also witness deep
learning techniques becoming top performers on those types of tasks. There are
however several problems that need to be solved for efficient use of deep neural
networks on text mining and text polarity analysis.
First of all, deep neural networks are data hungry. They need to be fed with
datasets that are big in size, cleaned and preprocessed as well as properly labeled.
Second, the modern natural language processing concept of word embeddings as a
dense and distributed text feature representation solves sparsity and dimensionality
problems of the traditional bag-of-words model. Still, there are various uncertainties
regarding the use of word vectors: should they be generated from the same dataset
that is used to train the model or it is better to source them from big and popular
collections that work as generic text feature representations? Third, it is not easy for
practitioners to find a simple and highly effective deep learning setup for various
document lengths and types. Recurrent neural networks are weak with longer texts
and optimal convolution-pooling combinations are not easily conceived. It is thus
convenient to have generic neural network architectures that are effective and can
adapt to various texts, encapsulating much of design complexity.
This thesis addresses the above problems to provide methodological and practical
insights for utilizing neural networks on sentiment analysis of texts and achieving
state of the art results. Regarding the first problem, the effectiveness of various
crowdsourcing alternatives is explored and two medium-sized and emotion-labeled
song datasets are created utilizing social tags. One of the research interests of Telecom
Italia was the exploration of relations between music emotional stimulation and
driving style. Consequently, a context-aware music recommender system that aims
to enhance driving comfort and safety was also designed. To address the second
problem, a series of experiments with large text collections of various contents and
domains were conducted. Word embeddings of different parameters were exercised
and results revealed that their quality is influenced (mostly but not only) by the
size of texts they were created from. When working with small text datasets, it is
thus important to source word features from popular and generic word embedding
collections. Regarding the third problem, a series of experiments involving convolutional
and max-pooling neural layers were conducted. Various patterns relating
text properties and network parameters with optimal classification accuracy were
observed. Combining convolutions of words, bigrams, and trigrams with regional
max-pooling layers in a couple of stacks produced the best results. The derived
architecture achieves competitive performance on sentiment polarity analysis of
movie, business and product reviews.
Given that labeled data are becoming the bottleneck of the current deep learning
systems, a future research direction could be the exploration of various data programming
possibilities for constructing even bigger labeled datasets. Investigation
of feature-level or decision-level ensemble techniques in the context of deep neural
networks could also be fruitful. Different feature types do usually represent complementary
characteristics of data. Combining word embedding and traditional text
features or utilizing recurrent networks on document splits and then aggregating the
predictions could further increase prediction accuracy of such models
Unsupervised and knowledge-poor approaches to sentiment analysis
Sentiment analysis focuses upon automatic classiffication of a document's sentiment (and more generally extraction of opinion from text). Ways of expressing sentiment have been
shown to be dependent on what a document is about (domain-dependency). This complicates supervised methods for sentiment analysis which rely on extensive use of training data or linguistic resources that are usually either domain-specific or generic. Both kinds of resources prevent classiffiers from performing well across a range of domains, as this requires appropriate in-domain (domain-specific) data.
This thesis presents a novel unsupervised, knowledge-poor approach to sentiment analysis aimed at creating a domain-independent and multilingual sentiment analysis system.
The approach extracts domain-specific resources from documents that are to be processed, and uses them for sentiment analysis. This approach does not require any training corpora, large sets of rules or generic sentiment lexicons, which makes it domain- and languageindependent but at the same time able to utilise domain- and language-specific information.
The thesis describes and tests the approach, which is applied to diffeerent data, including customer reviews of various types of products, reviews of films and books, and news items; and to four languages: Chinese, English, Russian and Japanese. The approach is applied not only to binary sentiment classiffication, but also to three-way sentiment classiffication (positive, negative and neutral), subjectivity classifiation of documents and sentences, and to the extraction of opinion holders and opinion targets. Experimental results suggest that the approach is often a viable alternative to supervised systems, especially when applied to large document collections
Making the FTC ☺: An Approach to Material Connections Disclosures in the Emoji Age
In examining the rise of influencer marketing and emoji’s concurrent surge in popularity, it naturally follows that emoji should be incorporated into the FTC’s required disclosures for sponsored posts across social media platforms. While current disclosure methods the FTC recommends are easily jumbled or lost in other text, using emoji to disclose material connections would streamline disclosure requirements, leveraging an already-popular method of communication to better reach consumers. This Note proposes that the FTC adopts an emoji as a preferred method of disclosure for influencer marketing on social media. Part I discusses the rise of influencer marketing, the FTC and its history of regulating sponsored content, and the current state of regulation. Part II explores the proliferation of emoji as a method of communication, and the role of the Unicode Consortium in regulating the adoption of new emoji. Part III makes the case for incorporating emoji as a method of disclosure to bridge compliance gaps, and offers additional recommendations to increase compliance with existing regulations
Rhetorical outcomes: A genre analysis of student service-learning writing
Service-learning continues to be a popular pedagogical approach within composition studies. Despite a number of studies that document a range of positive impacts on students, faculty, institutions, and community members, the relationship between service-learning and student writing outcomes is not well understood. This study presents the results of a genre analysis of student-authored ethnographies composed in four distinct sections of a service-learning--based intermediate writing course at a Midwestern urban research university. Results of the analysis are then used to develop a contextualized writing assessment framework to evaluate student writing outcomes and to consider the implications of using contemporary genre theory for both service-learning and writing program assessment
Semi-Supervised Learning For Identifying Opinions In Web Content
Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented
Algoritmos de análise de alinhamento de ofertas do mercado de trabalho
A adoção de tecnologias digitais tem prometido a aceleração e agilidade
das atividades laborais, processos e modelos de negócio. No entanto, as
promessas de ganho estão associados a uma forte necessidade de profissionais
qualificados que possuam a capacidade de aplicar o potencial da
tecnologia de forma eficiente. Os contextos de trabalhos estão a ser remodelados
à medida que novos modelos de interação e integração humana e
tecnológica evoluem. Para ser possível aumentar a prontidão do mercado
de trabalho em contextos de rápida mudança, é importante que os intervenientes
nas empresas, profissionais e responsáveis por políticas públicas
estejam cientes da dinâmica e das necessidades do mercado de trabalho.
Isto pode ser observado pela lista de anúncios de emprego, contudo, o seu
elevado número, requer ferramentas eficientes com o objetivo de os analisar
e simplificá-los para que, consequentemente, seja possível retirar conclusões
atempadamente e corretamente.
Como as propostas de emprego possuem formulações distintas para postos
semelhantes, dependendo das empresas que estão a contratar, surge
o desafio de estabelecer um equilíbrio para a comparação de propostas de
emprego. Neste trabalho, foi feita uma tentativa de mapeamento das propostas
de emprego nas ocupações da ESCO (European Skills/Competences,
qualifications and Occupations). ESCO é uma ontologia publicada pela
União Europeia e as suas ocupações traduzem-se em cargos num determinado
emprego e têm associadas as competências essenciais e facultativas
ao exercício da ocupação. A ELK (Elasticsearch, Logstash, Kibana) stack
foi usada para lidar com o grande volume de propostas de emprego. ELK
é uma ferramenta estável que gere avultadas quantidades de dados e, a
camada do Kibana, possibilita a rápida exploração dos dados e a criação de
painéis de visualização. Os resultados mostram que a ELK stack é uma ferramenta
adequada para providenciar uma interpretação visual das dinâmicas
do mercado de trabalho.
Foram testadas várias formas de alinhamento entre ofertas reais de emprego
e as ocupações ESCO. Os melhores resultados revelaram um f1-score de mais
de 0.8 no mapeamento de ofertas de emprego de ocupação de nível 1 da
ESCO e uma exatidão de 63.75% quando houve a tentativa de prever a
ocupação de nível 5. Estes resultados estão alinhados com o estado da arte
presente na literatura e são bastante promissores, especialmente quando
comparado ao patamar inicial de 40%, e mostra que a ESCO é um bom
candidato para estabelecer esse equilíbrio em que permite a comparação das
dinâmicas no mercado de trabalho para ambientes distintos.The adoption of digital technologies promises to accelerate the transformation
and the agility of processes, work activities and revenue models. Yet,
the promised gains come together with dramatic needs for qualified professionals
who can effectively leverage the technology potential. Job contexts
are being reshaped as new models for the interaction and integration of
humans and technologies take shape. To increase the readiness of the job
market in fast-changing contexts all stakeholders {companies, professionals,
policymakers must be aware of the job market dynamics and needs. These
dynamics can be observed from the collection of job announcements, but
its high volume requires effective tools for analyzing and simplifying it to
draw timely and correct conclusions.
As job announcements have distinct formulations for similar roles, depending
on the hiring company, this raises the necessity of establishing a common
ground for comparing the job offers. In this work, an attempt at mapping
job offers to ESCO (European Skills/Competences, qualifications and Occupations)
occupations is made. ESCO is an ontology published by the
European Union and its occupations are job positions with the mandatory
and optional skills associated. ELK (Elasticsearch, Logstash, Kibana) stack
was used for dealing with the high volume of job announcements. ELK is
a stable tool that can manage large quantities of data and has an effective
text search algorithm, the Kibana layer enables the rapid exploration of
data and creation of visualization dashboards. Results show that the ELK
stack is a suitable tool for providing a visual interpretation of the job market
dynamics.
Several strategies were tested to align real job offerings with ESCO occupations
and the best one revealed an f1 score of over 0.8 in mapping job offers
to level 1 ESCO occupations and an accuracy of 63.75% when trying to
predict the level 5 Occupation. These results are comparable to the state-of-
the-art and are very promising, especially when compared to the baseline
of 40%, and shows that ESCO is a good candidate as common ground to
enable the comparison of job market dynamics for distinct environments.Mestrado em Engenharia Informátic
The Mapmaker’s Dilemma in Evaluating High-End Inequality
The last thirty years have witnessed rising income and wealth concentration among the top 0.1% of the population, leading to intense political debate regarding how, if at all, policymakers should respond. Often, this debate emphasizes the tools of public economics, and in particular optimal income taxation. However, while these tools can help us in evaluating the issues raised by high-end inequality, their extreme reductionism—which, in other settings, often offers significant analytic payoffs—here proves to have serious drawbacks. This Article addresses what we do and don’t learn from the optimal income tax literature regarding high-end inequality, and what other inputs might be needed to help one evaluate the relevant issues
FINE-GRAINED EMOTION DETECTION IN MICROBLOG TEXT
Automatic emotion detection in text is concerned with using natural language processing techniques to recognize emotions expressed in written discourse. Endowing computers with the ability to recognize emotions in a particular kind of text, microblogs, has important applications in sentiment analysis and affective computing. In order to build computational models that can recognize the emotions represented in tweets we need to identify a set of suitable emotion categories. Prior work has mainly focused on building computational models for only a small set of six basic emotions (happiness, sadness, fear, anger, disgust, and surprise). This thesis describes a taxonomy of 28 emotion categories, an expansion of these six basic emotions, developed inductively from data. This set of 28 emotion categories represents a set of fine-grained emotion categories that are representative of the range of emotions expressed in tweets, microblog posts on Twitter.
The ability of humans to recognize these fine-grained emotion categories is characterized using inter-annotator reliability measures based on annotations provided by expert and novice annotators. A set of 15,553 human-annotated tweets form a gold standard corpus, EmoTweet-28. For each emotion category, we have extracted a set of linguistic cues (i.e., punctuation marks, emoticons, emojis, abbreviated forms, interjections, lemmas, hashtags and collocations) that can serve as salient indicators for that emotion category.
We evaluated the performance of automatic classification techniques on the set of 28 emotion categories through a series of experiments using several classifier and feature combinations. Our results shows that it is feasible to extend machine learning classification to fine-grained emotion detection in tweets (i.e., as many as 28 emotion categories) with results that are comparable to state-of-the-art classifiers that detect six to eight basic emotions in text. Classifiers using features extracted from the linguistic cues associated with each category equal or better the performance of conventional corpus-based and lexicon-based features for fine-grained emotion classification.
This thesis makes an important theoretical contribution in the development of a taxonomy of emotion in text. In addition, this research also makes several practical contributions, particularly in the creation of language resources (i.e., corpus and lexicon) and machine learning models for fine-grained emotion detection in text
- …