1,931 research outputs found

    Comparing Supervised Machine Learning Strategies and Linguistic Features to Search for Very Negative Opinions

    Get PDF
    In this paper, we examine the performance of several classifiers in the process of searching for very negative opinions. More precisely, we do an empirical study that analyzes the influence of three types of linguistic features (n-grams, word embeddings, and polarity lexicons) and their combinations when they are used to feed different supervised machine learning classifiers: Naive Bayes (NB), Decision Tree (DT), and Support Vector Machine (SVM). The experiments we have carried out show that SVM clearly outperforms NB and DT in all datasets by taking into account all features individually as well as their combinationsThis research was funded by project TelePares (MINECO, ref:FFI2014-51978-C2-1-R), and the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08) and the European Regional Development Fund (ERDF)S

    Sentiment Analysis of Text Guided by Semantics and Structure

    Get PDF
    As moods and opinions play a pivotal role in various business and economic processes, keeping track of one's stakeholders' sentiment can be of crucial importance to decision makers. Today's abundance of user-generated content allows for the automated monitoring of the opinions of many stakeholders, like consumers. One challenge for such automated sentiment analysis systems is to identify whether pieces of natural language text are positive or negative. Typical methods of identifying this polarity involve low-level linguistic analysis. Existing systems predominantly use morphological, lexical, and syntactic cues for polarity, like a text's words, their parts-of-speech, and negation or amplification of the conveyed sentiment. This dissertation argues that the polarity of text can be analysed more accurately when additionally accounting for semantics and structure. Polarity classification performance can benefit from exploiting the interactions that emoticons have on a semantic level with words – emoticons can express, stress, or disambiguate sentiment. Furthermore, semantic relations between and within languages can help identify meaningful cues for sentiment in multi-lingual polarity classification. An even better understanding of a text's conveyed sentiment can be obtained by guiding automated sentiment analysis by the rhetorical structure of the text, or at least of its most sentiment-carrying segments. Thus, the sentiment in, e.g., conclusions can be treated differently from the sentiment in background information. The findings of this dissertation suggest that the polarity of natural language text should not be determined solely based on what is said. Instead, one should account for how this message is conveyed as well

    Sentiment Analysis Using Maximum Entropy on Application Reviews (Study Case: Shopee on Google Play)

    Get PDF
    Shopee was one of the e-commerce application that could found on Google Play. The amount of Shopee application reviews on Google Play continues to grow over time. These make the company trying to get the overall information from all reviews because it would take a long time to read each of the reviews on Google Play. Therefore analysis was used using text mining. One part of text mining was sentiment analysis that applied the maximum entropy method to classification. Based on the results of the analysis found an accuracy of 97.32%. By using the maximum entropy method it could be concluded that word association obtained related to “application”, “promo”, “satisfy”, and “discount” for positive sentiment. Meanwhile for negative sentiment, the reviewers of Shopee application on Google Play were related to “problematic”, “login”, “old”, “verification”, and “expensive”. The results of this research in Indonesian

    Feature-based Transfer of Multilingual Sentence Representations to Cross-lingual Tasks

    Get PDF
    Universella meningsrepresentationer och flerspråkig språkmodellering är heta ämnen inom språkteknologi, specifikt området som berör förståelse för naturligt språk (natural language understanding). En meningsinbäddning (sentence embedding) är en numerisk skildring av en följd ord som motsvaras av en hel fras eller mening, speficikt som ett resultat av en omkodare (encoder) inom maskininlärning. Dessa representationer behövs för automatiska uppgifter inom språkteknologi som kräver förståelse för betydelsen av en hel mening, till skillnad från kombinationer av enskilda ords betydelser. Till sådana uppgifter kan räknas till exempel inferens (huruvida ett par satser är logiskt anknutna, natural language inference) samt åsiktsanalys (sentiment analysis). Med universalitet avses kodad betydelse som är tillräckligt allmän för att gynna andra relaterade uppgifter, som till exempel klassificering. Det efterfrågas tydligare samförstånd kring strategier som används för att bedöma kvaliteten på dessa inbäddningar, antingen genom att direkt undersöka deras lingvistiska egenskaper eller genom att använda dem som oberoende variabler (features) i relaterade modeller. På grund av att det är kostsamt att skapa resurser av hög kvalitet och upprätthålla sofistikerade system på alla språk som används i världen finns det även ett stort intresse för uppskalering av moderna system till språk med knappa resurser. Tanken med detta är så kallad överföring (transfer) av kunskap inte bara mellan olika uppgifter, utan även mellan olika språk. Trots att behovet av tvärspråkiga överföringsmetoder erkänns i forskningssamhället är utvärderingsverktyg och riktmärken fortfarande i ett tidigt skede. SentEval är ett existerande verktyg för utvärdering av meningsinbäddningar med speciell betoning på deras universalitet. Syftet med detta avhandlingsprojekt är ett försök att utvidga detta verktyg att stödja samtidig bedömning på nya uppgifter som omfattar flera olika språk. Bedömningssättet bygger på strategin att låta kodade meningar fungera som variabler i så kallade downstream-uppgifter och observera huruvida resultaten förbättras. En modern mångspråkig modell baserad på så kallad transformers-arkitektur utvärderas på en etablerad inferensuppgift såväl som en ny känsloanalyssuppgift (emotion detection), av vilka båda omfattar data på en mängd olika språk. Även om det praktiska genomförandet i stor utsträckning förblev experimentellt rapporteras vissa tentativa resultat i denna avhandling

    Comparing sentiment analysis tools on gitHub project discussions

    Get PDF
    Mestrado de dupla diplomação com a UTFPR - Universidade Tecnológica Federal do ParanáThe context of this work is situated in the rapidly evolving sphere of Natural Language Processing (NLP) within the scope of software engineering, focusing on sentiment analysis in software repositories. Sentiment analysis, a subfield of NLP, provides a potent method to parse, understand, and categorize these sentiments expressed in text. By applying sentiment analysis to software repositories, we can decode developers’ opinions and sentiments, providing key insights into team dynamics, project health, and potential areas of conflict or collaboration. However, the application of sentiment analysis in software engineering comes with its unique set of challenges. Technical jargon, code-specific ambiguities, and the brevity of software-related communications demand tailored NLP tools for effective analysis. The study unfolds in two primary phases. In the initial phase, we embarked on a meticulous investigation into the impacts of expanding the training sets of two prominent sentiment analysis tools, namely, SentiCR and SentiSW. The objective was to delineate the correlation between the size of the training set and the resulting tool performance, thereby revealing any potential enhancements in performance. The subsequent phase of the research encapsulates a practical application of the enhanced tools. We employed these tools to categorize discussions drawn from issue tickets within a varied array of Open-Source projects. These projects span an extensive range, from relatively small repositories to large, well-established repositories, thus providing a rich and diverse sampling ground.O contexto deste trabalho situa-se na esfera em rápida evolução do Processamento de Linguagem Natural (PLN) no âmbito da engenharia de software, com foco na análise de sentimentos em repositórios de software. A análise de sentimentos, um subcampo do PLN, fornece um método poderoso para analisar, compreender e categorizar os sentimentos expressos em texto. Ao aplicar a análise de sentimentos aos repositórios de software, podemos decifrar as opiniões e sentimentos dos desenvolvedores, fornecendo informações importantes sobre a dinâmica da equipe, a saúde do projeto e áreas potenciais de conflito ou colaboração. No entanto, a aplicação da análise de sentimentos na engenharia de software apresenta desafios únicos. Jargão técnico, ambiguidades específicas do código e a breviedade das comunicações relacionadas ao software exigem ferramentas de PLN personalizadas para uma análise eficaz. O estudo se desenvolve em duas fases principais. Na fase inicial, embarcamos em uma investigação meticulosa sobre os impactos da expansão dos conjuntos de treinamento de duas ferramentas proeminentes de análise de sentimentos, nomeadamente, SentiCR e SentiSW. O objetivo foi delinear a correlação entre o tamanho do conjunto de treinamento e o desempenho da ferramenta resultante, revelando assim possíveis aprimoramentos no desempenho. A fase subsequente da pesquisa engloba uma aplicação prática das ferramentas aprimoradas. Utilizamos essas ferramentas para categorizar discussões retiradas de bilhetes de problemas em uma variedade diversificada de projetos de código aberto. Esses projetos abrangem uma ampla gama, desde repositórios relativamente pequenos até repositórios grandes e bem estabelecidos, fornecendo assim um campo de amostragem rico e diversificado

    Distributed Representations for Compositional Semantics

    Full text link
    The mathematical representation of semantics is a key issue for Natural Language Processing (NLP). A lot of research has been devoted to finding ways of representing the semantics of individual words in vector spaces. Distributional approaches --- meaning distributed representations that exploit co-occurrence statistics of large corpora --- have proved popular and successful across a number of tasks. However, natural language usually comes in structures beyond the word level, with meaning arising not only from the individual words but also the structure they are contained in at the phrasal or sentential level. Modelling the compositional process by which the meaning of an utterance arises from the meaning of its parts is an equally fundamental task of NLP. This dissertation explores methods for learning distributed semantic representations and models for composing these into representations for larger linguistic units. Our underlying hypothesis is that neural models are a suitable vehicle for learning semantically rich representations and that such representations in turn are suitable vehicles for solving important tasks in natural language processing. The contribution of this thesis is a thorough evaluation of our hypothesis, as part of which we introduce several new approaches to representation learning and compositional semantics, as well as multiple state-of-the-art models which apply distributed semantic representations to various tasks in NLP.Comment: DPhil Thesis, University of Oxford, Submitted and accepted in 201

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective

    Get PDF
    This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition. Specifically, it categorises the cross-dataset recognition into seventeen problems based on a set of carefully chosen data and label attributes. Such a problem-oriented taxonomy has allowed us to examine how different transfer learning approaches tackle each problem and how well each problem has been researched to date. The comprehensive problem-oriented review of the advances in transfer learning with respect to the problem has not only revealed the challenges in transfer learning for visual recognition, but also the problems (e.g. eight of the seventeen problems) that have been scarcely studied. This survey not only presents an up-to-date technical review for researchers, but also a systematic approach and a reference for a machine learning practitioner to categorise a real problem and to look up for a possible solution accordingly

    Analyzing social media discourse - an approach using semi-supervised learning

    Get PDF
    The ability to handle large amounts of unstructured information, to optimize strategic business opportunities, and to identify fundamental lessons among competitors through benchmarking, are essential skills of every business sector. Currently, there are dozens of social media analytics’ applications aiming at providing organizations with informed decision making tools. However, these applications rely on providing quantitative information, rather than qualitative information that is relevant and intelligible for managers. In order to address these aspects, we propose a semi-supervised learning procedure that discovers and compiles information taken from online social media, organizing it in a scheme that can be strategically relevant. We illustrate our procedure using a case study where we collected and analysed the social media discourse of 43 organizations operating on the Higher Public Polytechnic Education Sector. During the analysis we created an “editorial model” that character izes the posts in the area. We describe in detail the training and the execution of an ensemble of classifying algorithms. In this study we focus on the techniques used to increase the accuracy and stability of the classifiers.info:eu-repo/semantics/publishedVersio
    corecore