1,393 research outputs found

    Multilingual opinion mining

    Get PDF
    170 p.Cada día se genera gran cantidad de texto en diferentes medios online. Gran parte de ese texto contiene opiniones acerca de multitud de entidades, productos, servicios, etc. Dada la creciente necesidad de disponer de medios automatizados para analizar, procesar y explotar esa información, las técnicas de análisis de sentimiento han recibido gran cantidad de atención por parte de la industria y la comunidad científica durante la última década y media. No obstante, muchas de las técnicas empleadas suelen requerir de entrenamiento supervisado utilizando para ello ejemplos anotados manualmente, u otros recursos lingüísticos relacionados con un idioma o dominio de aplicación específicos. Esto limita la aplicación de este tipo de técnicas, ya que dicho recursos y ejemplos anotados no son sencillos de obtener. En esta tesis se explora una serie de métodos para realizar diversos análisis automáticos de texto en el marco del análisis de sentimiento, incluyendo la obtención automática de términos de un dominio, palabras que expresan opinión, polaridad del sentimiento de dichas palabras (positivas o negativas), etc. Finalmente se propone y se evalúa un método que combina representación continua de palabras (continuous word embeddings) y topic-modelling inspirado en la técnica de Latent Dirichlet Allocation (LDA), para obtener un sistema de análisis de sentimiento basado en aspectos (ABSA), que sólo necesita unas pocas palabras semilla para procesar textos de un idioma o dominio determinados. De este modo, la adaptación a otro idioma o dominio se reduce a la traducción de las palabras semilla correspondientes

    Deriving query suggestions for site search

    Get PDF
    Modern search engines have been moving away from simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features are now integral parts of web search engines. However, generating good query modification suggestions remains a challenging issue. Query log analysis is one of the major strands of work in this direction. Although much research has been performed on query logs collected on the web as a whole, query log analysis to enhance search on smaller and more focused collections has attracted less attention, despite its increasing practical importance. In this article, we report on a systematic study of different query modification methods applied to a substantial query log collected on a local website that already uses an interactive search engine. We conducted experiments in which we asked users to assess the relevance of potential query modification suggestions that have been constructed using a range of log analysis methods and different baseline approaches. The experimental results demonstrate the usefulness of log analysis to extract query modification suggestions. Furthermore, our experiments demonstrate that a more fine-grained approach than grouping search requests into sessions allows for extraction of better refinement terms from query log files. © 2013 ASIS&T

    Methods for constructing an opinion network for politically controversial topics

    Get PDF
    The US presidential race, the re-election of President Hugo Chavez, and the economic crisis in Greece and other European countries are some of the controversial topics being played on the news everyday. To understand the landscape of opinions on political controversies, it would be helpful to know which politician or other stakeholder takes which position - support or opposition - on specific aspects of these topics. The work described in this thesis aims to automatically derive a map of the opinions-people network from news and other Web docu- ments. The focus is on acquiring opinions held by various stakeholders on politi- cally controversial topics. This opinions-people network serves as a knowledge- base of opinions in the form of (opinion holder) (opinion) (topic) triples. Our system to build this knowledge-base makes use of online news sources in order to extract opinions from text snippets. These sources come with a set of unique challenges. For example, processing text snippets involves not just iden- tifying the topic and the opinion, but also attributing that opinion to a specific opinion holder. This requires making use of deep parsing and analyzing the parse tree. Moreover, in order to ensure uniformity, both the topic as well the opinion holder should be mapped to canonical strings, and the topics should also be organized into a hierarchy. Our system relies on two main components: i) acquiring opinions which uses a combination of techniques to extract opinions from online news sources, and ii) organizing topics which crawls and extracts de- bates from online sources, and organizes these debates in a hierarchy of political controversial topics. We present systematic evaluations of the different compo- nents of our system, and show their high accuracies. We also present some of the different kinds of applications that require political analysis. We present some application requires political analysis such as identifying flip-floppers, political bias, and dissenters. Such applications can make use of the knowledge-base of opinions.Kontroverse Themen wie das US-Präsidentschaftsrennen, die Wiederwahl von Präsident Hugo Chavez, die Wirtschaftskrise in Griechenland sowie in anderen europäischen Ländern werden täglich in den Nachrichten diskutiert. Um die Bandbreite verschiedener Meinungen zu politischen Kontroversen zu verstehen, ist es hilfreich herauszufinden, welcher Politiker bzw. Interessenvertreter welchen Standpunkt (Pro oder Contra) bezüglich spezifischer Aspekte dieser Themen einnimmt. Diese Dissertation beschreibt ein Verfahren, welches automatisch eine Übersicht des Meinung-Mensch-Netzwerks aus aktuellen Nachrichten und anderen Web-Dokumenten ableitet. Der Fokus liegt hierbei auf dem Erfassen von Meinungen verschiedener Interessenvertreter bezüglich politisch kontroverser Themen. Dieses Meinung-Mensch-Netzwerk dient als Wissensbasis von Meinungen in Form von Tripeln: (Meinungsvertreter) (Meinung) (Thema). Um diese Wissensbasis aufzubauen, nutzt unser System Online-Nachrichten und extrahiert Meinungen aus Textausschnitten. Quellen von Online-Nachrichten stellen eine Reihe von besonderen Anforderungen an unser System. Zum Beispiel umfasst die Verarbeitung von Textausschnitten nicht nur die Identifikation des Themas und der geschilderten Meinung, sondern auch die Zuordnung der Stellungnahme zu einem spezifischen Meinungsvertreter.Dies erfordert eine tiefgründige Analyse sowie eine genaue Untersuchung des Syntaxbaumes. Um die Einheitlichkeit zu gewährleisten, müssen darüber hinaus Thema sowie Meinungsvertreter auf ein kanonisches Format abgebildet und die Themen hierarchisch angeordnet werden. Unser System beruht im Wesentlichen auf zwei Komponenten: i) Erkennen von Meinungen, welches verschiedene Techniken zur Extraktion von Meinungen aus Online-Nachrichten beinhaltet, und ii) Erkennen von Beziehungen zwischen Themen, welches das Crawling und Extrahieren von Debatten aus Online-Quellen sowie das Organisieren dieser Debatten in einer Hierarchie von politisch kontroversen Themen umfasst. Wir präsentieren eine systematische Evaluierung der verschiedenen Systemkomponenten, welche die hohe Genauigkeit der von uns entwickelten Techniken zeigt. Wir diskutieren außerdem verschiedene Arten von Anwendungen, die eine politische Analyse erfordern, wie zum Beispiel die Erkennung von Opportunisten, politische Voreingenommenheit und Dissidenten. All diese Anwendungen können durch die Wissensbasis von Meinungen umfangreich profitieren

    Plague Dot Text:Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

    Get PDF
    The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century.Comment: Journal of Data Mining & Digital Humanities 202

    Grammatical Triples Extraction for the Distant Reading of Textual Corpora

    Get PDF
    Grammatical triples extraction has become increasingly important for the analysis of large, textual corpora. By providing insight into the sentence-level linguistic features of a corpus, extracted triples have supported interpretations of some of the most relevant problems of our time. The growing importance of triples extraction for analyzing large corpora has put the quality of extracted triples under new scrutiny, however. Triples outputs are known to have large amounts of erroneous triples. The extraction of erroneous triples poses a risk for understanding a textual corpus because erroneous triples can be nonfactual and even analogous to misinformation. Disciplines such as the social sciences, history, and literature rely on accurate representations of events. In some cases, misrepresentations of language can be as problematic as describing a historical event that never occurred. The present research proposes a method of triples extraction that has been designed to meet the increasing need for high-accuracy triples outputs for the analysis of text. We propose a solution aimed at reducing errors related to: a) ungrammatical extractions; b) double counting; and c) the missed detection of triples. To improve the accuracy of triples extraction, we implement a series of 12 linguistic rules that leverage syntactic dependency parsing. For its case studies, this dissertation draws upon three data sets: a) Wikipedia; b) the 19th-century British Parliamentary debates, also known as Hansard; and c) half a year of online news articles (Aug. 2021 - Dec. 2021) from FOX News and NPR. In its final chapter, this dissertation offers a pedagogical piece that applies triples extraction to teach concepts related to data analysis. Extracted triples are thus evaluated through two means: a) in Chapter 1, precision and recall is used to vet the accuracy of the present method and b) in chapters 2 and 3, we use human observation to show how the present method of triples extraction can give an accurate and insightful perspective into textual corpora that rivals and, in some cases, exceeds existing methods

    Combining granularity-based topic-dependent and topic-independent evidences for opinion detection

    Get PDF
    Fouille des opinion, une sous-discipline dans la recherche d'information (IR) et la linguistique computationnelle, fait référence aux techniques de calcul pour l'extraction, la classification, la compréhension et l'évaluation des opinions exprimées par diverses sources de nouvelles en ligne, social commentaires des médias, et tout autre contenu généré par l'utilisateur. Il est également connu par de nombreux autres termes comme trouver l'opinion, la détection d'opinion, l'analyse des sentiments, la classification sentiment, de détection de polarité, etc. Définition dans le contexte plus spécifique et plus simple, fouille des opinion est la tâche de récupération des opinions contre son besoin aussi exprimé par l'utilisateur sous la forme d'une requête. Il y a de nombreux problèmes et défis liés à l'activité fouille des opinion. Dans cette thèse, nous nous concentrons sur quelques problèmes d'analyse d'opinion. L'un des défis majeurs de fouille des opinion est de trouver des opinions concernant spécifiquement le sujet donné (requête). Un document peut contenir des informations sur de nombreux sujets à la fois et il est possible qu'elle contienne opiniâtre texte sur chacun des sujet ou sur seulement quelques-uns. Par conséquent, il devient très important de choisir les segments du document pertinentes à sujet avec leurs opinions correspondantes. Nous abordons ce problème sur deux niveaux de granularité, des phrases et des passages. Dans notre première approche de niveau de phrase, nous utilisons des relations sémantiques de WordNet pour trouver cette association entre sujet et opinion. Dans notre deuxième approche pour le niveau de passage, nous utilisons plus robuste modèle de RI i.e. la language modèle de se concentrer sur ce problème. L'idée de base derrière les deux contributions pour l'association d'opinion-sujet est que si un document contient plus segments textuels (phrases ou passages) opiniâtre et pertinentes à sujet, il est plus opiniâtre qu'un document avec moins segments textuels opiniâtre et pertinentes. La plupart des approches d'apprentissage-machine basée à fouille des opinion sont dépendants du domaine i.e. leurs performances varient d'un domaine à d'autre. D'autre part, une approche indépendant de domaine ou un sujet est plus généralisée et peut maintenir son efficacité dans différents domaines. Cependant, les approches indépendant de domaine souffrent de mauvaises performances en général. C'est un grand défi dans le domaine de fouille des opinion à développer une approche qui est plus efficace et généralisé. Nos contributions de cette thèse incluent le développement d'une approche qui utilise de simples fonctions heuristiques pour trouver des documents opiniâtre. Fouille des opinion basée entité devient très populaire parmi les chercheurs de la communauté IR. Il vise à identifier les entités pertinentes pour un sujet donné et d'en extraire les opinions qui leur sont associées à partir d'un ensemble de documents textuels. Toutefois, l'identification et la détermination de la pertinence des entités est déjà une tâche difficile. Nous proposons un système qui prend en compte à la fois l'information de l'article de nouvelles en cours ainsi que des articles antérieurs pertinents afin de détecter les entités les plus importantes dans les nouvelles actuelles. En plus de cela, nous présentons également notre cadre d'analyse d'opinion et tâches relieés. Ce cadre est basée sur les évidences contents et les évidences sociales de la blogosphère pour les tâches de trouver des opinions, de prévision et d'avis de classement multidimensionnel. Cette contribution d'prématurée pose les bases pour nos travaux futurs. L'évaluation de nos méthodes comprennent l'utilisation de TREC 2006 Blog collection et de TREC Novelty track 2004 collection. La plupart des évaluations ont été réalisées dans le cadre de TREC Blog track.Opinion mining is a sub-discipline within Information Retrieval (IR) and Computational Linguistics. It refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online sources like news articles, social media comments, and other user-generated content. It is also known by many other terms like opinion finding, opinion detection, sentiment analysis, sentiment classification, polarity detection, etc. Defining in more specific and simpler context, opinion mining is the task of retrieving opinions on an issue as expressed by the user in the form of a query. There are many problems and challenges associated with the field of opinion mining. In this thesis, we focus on some major problems of opinion mining

    SARE: A sentiment analysis research environment

    Get PDF
    Sentiment analysis is an important learning problem with a broad scope of applications. The meteoric rise of online social media and the increasing significance of public opinion expressed therein have opened doors to many challenges as well as opportunities for this research. The challenges have been articulated in the literature through a growing list of sentiment analysis problems and tasks, while the opportunities are constantly being availed with the introduction of new algorithms and techniques for solving them. However, these approaches often remain out of the direct reach of other researchers, who have to either rely on benchmark datasets, which are not always available, or be inventive with their comparisons. This thesis presents Sentiment Analysis Research Environment (SARE), an extendable and publicly-accessible system designed with the goal of integrating baseline and state of- the-art approaches to solving sentiment analysis problems. Since covering the entire breadth of the field is beyond the scope of this work, the usefulness of this environment is demonstrated by integrating solutions for certain facets of the aspect-based sentiment analysis problem. Currently, the system provides a semi-automatic method to support building gold-standard lexica, an automatic baseline method for extracting aspect expressions, and a pre-existing baseline sentiment analysis engine. Users are assisted in creating gold-standard lexica by applying our proposed set cover approximation algorithm, which finds a significantly reduced set of documents needed to create a lexicon. We also suggest a baseline semi-supervised aspect expression extraction algorithm based on a Support Vector Machine (SVM) classifier to automatically extract aspect expressions

    Application of Common Sense Computing for the Development of a Novel Knowledge-Based Opinion Mining Engine

    Get PDF
    The ways people express their opinions and sentiments have radically changed in the past few years thanks to the advent of social networks, web communities, blogs, wikis and other online collaborative media. The distillation of knowledge from this huge amount of unstructured information can be a key factor for marketers who want to create an image or identity in the minds of their customers for their product, brand, or organisation. These online social data, however, remain hardly accessible to computers, as they are specifically meant for human consumption. The automatic analysis of online opinions, in fact, involves a deep understanding of natural language text by machines, from which we are still very far. Hitherto, online information retrieval has been mainly based on algorithms relying on the textual representation of web-pages. Such algorithms are very good at retrieving texts, splitting them into parts, checking the spelling and counting their words. But when it comes to interpreting sentences and extracting meaningful information, their capabilities are known to be very limited. Existing approaches to opinion mining and sentiment analysis, in particular, can be grouped into three main categories: keyword spotting, in which text is classified into categories based on the presence of fairly unambiguous affect words; lexical affinity, which assigns arbitrary words a probabilistic affinity for a particular emotion; statistical methods, which calculate the valence of affective keywords and word co-occurrence frequencies on the base of a large training corpus. Early works aimed to classify entire documents as containing overall positive or negative polarity, or rating scores of reviews. Such systems were mainly based on supervised approaches relying on manually labelled samples, such as movie or product reviews where the opinionist’s overall positive or negative attitude was explicitly indicated. However, opinions and sentiments do not occur only at document level, nor they are limited to a single valence or target. Contrary or complementary attitudes toward the same topic or multiple topics can be present across the span of a document. In more recent works, text analysis granularity has been taken down to segment and sentence level, e.g., by using presence of opinion-bearing lexical items (single words or n-grams) to detect subjective sentences, or by exploiting association rule mining for a feature-based analysis of product reviews. These approaches, however, are still far from being able to infer the cognitive and affective information associated with natural language as they mainly rely on knowledge bases that are still too limited to efficiently process text at sentence level. In this thesis, common sense computing techniques are further developed and applied to bridge the semantic gap between word-level natural language data and the concept-level opinions conveyed by these. In particular, the ensemble application of graph mining and multi-dimensionality reduction techniques on two common sense knowledge bases was exploited to develop a novel intelligent engine for open-domain opinion mining and sentiment analysis. The proposed approach, termed sentic computing, performs a clause-level semantic analysis of text, which allows the inference of both the conceptual and emotional information associated with natural language opinions and, hence, a more efficient passage from (unstructured) textual information to (structured) machine-processable data. The engine was tested on three different resources, namely a Twitter hashtag repository, a LiveJournal database and a PatientOpinion dataset, and its performance compared both with results obtained using standard sentiment analysis techniques and using different state-of-the-art knowledge bases such as Princeton’s WordNet, MIT’s ConceptNet and Microsoft’s Probase. Differently from most currently available opinion mining services, the developed engine does not base its analysis on a limited set of affect words and their co-occurrence frequencies, but rather on common sense concepts and the cognitive and affective valence conveyed by these. This allows the engine to be domain-independent and, hence, to be embedded in any opinion mining system for the development of intelligent applications in multiple fields such as Social Web, HCI and e-health. Looking ahead, the combined novel use of different knowledge bases and of common sense reasoning techniques for opinion mining proposed in this work, will, eventually, pave the way for development of more bio-inspired approaches to the design of natural language processing systems capable of handling knowledge, retrieving it when necessary, making analogies and learning from experience
    • …
    corecore