10 research outputs found

    Recognizing Biographical Sections in Wikipedia

    Get PDF
    Wikipedia is the largest collection of encyclopedic data ever written in the history of humanity. Thanks to its coverage and its availability in machine-readable format, it has become a primary resource for largescale research in historical and cultural studies. In this work, we focus on the subset of pages describing persons, and we investigate the task of recognizing biographical sections from them: given a person’s page, we identify the list of sections where information about her/his life is present. We model this as a sequence classification problem, and propose a supervised setting, in which the training data are acquired automatically. Besides, we show that six simple features extracted only from the section titles are very informative and yield good results well above a strong baseline

    Multi-lingual Opinion Mining on YouTube

    Get PDF
    In order to successfully apply opinion mining (OM) to the large amounts of user-generated content produced every day, we need robust models that can handle the noisy input well yet can easily be adapted to a new domain or language. We here focus on opinion mining for YouTube by (i) modeling classifiers that predict the type of a comment and its polarity, while distinguishing whether the polarity is directed towards the product or video; (ii) proposing a robust shallow syntactic structure (STRUCT) that adapts well when tested across domains; and (iii) evaluating the effectiveness on the proposed structure on two languages, English and Italian. We rely on tree kernels to automatically extract and learn features with better generalization power than traditionally used bag-of-word models. Our extensive empirical evaluation shows that (i) STRUCT outperforms the bag-of-words model both within the same domain (up to 2.6% and 3% of absolute improvement for Italian and English, respectively); (ii) it is particularly useful when tested across domains (up to more than 4% absolute improvement for both languages), especially when little training data is available (up to 10% absolute improvement) and (iii) the proposed structure is also effective in a lower-resource language scenario, where only less accurate linguistic processing tools are available

    Sentiment Propagation via Implicature Constraints

    Full text link
    Opinions may be expressed implicitly via inference over explicit sentiments and events that positively/negatively affect en-tities (goodFor/badFor events). We in-vestigate how such inferences may be exploited to improve sentiment analysis, given goodFor/badFor event information. We apply Loopy Belief Propagation to propagate sentiments among entities. The graph-based model improves over explicit sentiment classification by 10 points in precision and, in an evaluation of the model itself, we find it has an 89 % chance of propagating sentiments correctly.

    Paris and Stanford at EPE 2017: Downstream Evaluation of Graph-based Dependency Representations

    Get PDF
    International audienceWe describe the STANFORD-PARIS and PARIS-STANFORD submissions to the 2017 Extrinsic Parser Evaluation (EPE) Shared Task. The purpose of this shared task was to evaluate dependency graphs on three downstream tasks. Through our submissions, we evaluated the usability of several representations derived from English Universal Dependencies (UD), as well as the Stanford Dependencies (SD), Predicate Argument Structure (PAS), and DM representations. We further compared two parsing strategies: Directly parsing to graph-based dependency representations and a two-stage process of first parsing to surface syntax trees and then applying rule-based augmentations to obtain the final graphs. Overall, our systems performed very well and our submissions ranked first and third. In our analysis, we find that the two-stage parsing process leads to better downstream performance, and that enhanced UD, a graph-based representation, consistently outperforms basic UD, a strict surface syntax representation, suggesting an advantage of enriched representations for downstream tasks

    Sentiment Analysis of Short Informal Texts

    Get PDF
    Abstract We describe a state-of-the-art sentiment analysis system that detects (a) the sentiment of short informal textual messages such as tweets and SMS (message-level task) and (b) the sentiment of a word or a phrase within a message (term-level task). The system is based on a supervised statistical text classification approach leveraging a variety of surfaceform, semantic, and sentiment features. The sentiment features are primarily derived from novel high-coverage tweet-specific sentiment lexicons. These lexicons are automatically generated from tweets with sentiment-word hashtags and from tweets with emoticons. To adequately capture the sentiment of words in negated contexts, a separate sentiment lexicon is generated for negated words. The system ranked first in the SemEval-2013 shared task 'Sentiment Analysis in Twitter' (Task 2), obtaining an F-score of 69.02 in the message-level task and 88.93 in the term-level task. Post-competition improvements boost the performance to an F-score of 70.45 (message-level task) and 89.50 (term-level task). The system also obtains state-ofthe-art performance on two additional datasets: the SemEval-2013 SMS test set and a corpus of movie review excerpts. The ablation experiments demonstrate that the use of the automatically generated lexicons results in performance gains of up to 6.5 absolute percentage points

    Extracting and Attributing Quotes in Text and Assessing them as Opinions

    Get PDF
    News articles often report on the opinions that salient people have about important issues. While it is possible to infer an opinion from a person's actions, it is much more common to demonstrate that a person holds an opinion by reporting on what they have said. These instances of speech are called reported speech, and in this thesis we set out to detect instances of reported speech, attribute them to their speaker, and to identify which instances provide evidence of an opinion. We first focus on extracting reported speech, which involves finding all acts of communication that are reported in an article. Previous work has approached this task with rule-based methods, however there are several factors that confound these approaches. To demonstrate this, we build a corpus of 965 news articles, where we mark all instances of speech. We then show that a supervised token-based approach outperforms all of our rule-based alternatives, even in extracting direct quotes. Next, we examine the problem of finding the speaker of each quote. For this task we annotate the same 965 news articles with links from each quote to its speaker. Using this, and three other corpora, we develop new methods and features for quote attribution, which achieve state-of-the-art accuracy on our corpus and strong results on the others. Having extracted quotes and determined who spoke them, we move on to the opinion mining part of our work. Most of the task definitions in opinion mining do not easily work with opinions in news, so we define a new task, where the aim is to classify whether quotes demonstrate support, neutrality, or opposition to a given position statement. This formulation improved annotator agreement when compared to our earlier annotation schemes. Using this we build an opinion corpus of 700 news documents covering 7 topics. In this thesis we do not attempt this full task, but we do present preliminary results

    Genre and Domain Dependencies in Sentiment Analysis

    Get PDF
    Genre and domain influence an author\''s style of writing and therefore a text\''s characteristics. Natural language processing is prone to such variations in textual characteristics: it is said to be genre and domain dependent. This thesis investigates genre and domain dependencies in sentiment analysis. Its goal is to support the development of robust sentiment analysis approaches that work well and in a predictable manner under different conditions, i.e. for different genres and domains. Initially, we show that a prototypical approach to sentiment analysis -- viz. a supervised machine learning model based on word n-gram features -- performs differently on gold standards that originate from differing genres and domains, but performs similarly on gold standards that originate from resembling genres and domains. We show that these gold standards differ in certain textual characteristics, viz. their domain complexity. We find a strong linear relation between our approach\''s accuracy on a particular gold standard and its domain complexity, which we then use to estimate our approach\''s accuracy. Subsequently, we use certain textual characteristics -- viz. domain complexity, domain similarity, and readability -- in a variety of applications. Domain complexity and domain similarity measures are used to determine parameter settings in two tasks. Domain complexity guides us in model selection for in-domain polarity classification, viz. in decisions regarding word n-gram model order and word n-gram feature selection. Domain complexity and domain similarity guide us in domain adaptation. We propose a novel domain adaptation scheme and apply it to cross-domain polarity classification in semi- and unsupervised domain adaptation scenarios. Readability is used for feature engineering. We propose to adopt readability gradings, readability indicators as well as word and syntax distributions as features for subjectivity classification. Moreover, we generalize a framework for modeling and representing negation in machine learning-based sentiment analysis. This framework is applied to in-domain and cross-domain polarity classification. We investigate the relation between implicit and explicit negation modeling, the influence of negation scope detection methods, and the efficiency of the framework in different domains. Finally, we carry out a case study in which we transfer the core methods of our thesis -- viz. domain complexity-based accuracy estimation, domain complexity-based model selection, and negation modeling -- to a gold standard that originates from a genre and domain hitherto not used in this thesis

    Information models in sentiment analysis based on linguistic resources

    Get PDF
    Почетак новог миленијума обележен је бурним развојем друштвених мрежа, интернет технологијама у облаку и применом вештачке интелигенције у веб алатима. Изузетно брз раст броја текстова на интернету (блогова, сајтова за електронску трговину, форума, дискусионих група, система за пренос кратких порука, друштвених мрежа и портала за објаву вести) увећао је потребу за развојем метода брзе, свеобухватне и прецизне анализе текста. Због тога је значајан развој језичких технологија чији су примарни задаци: класификација докумената (енг. Document classification), груписање докумената (енг. Document clustering), проналажење информација (енг. Information Retrieval), разрешавање значења вишезначних речи (енг. Word-sense disambiguation), екстракција из текста (енг. Text еxtraction), машинско превођење (енг. Machine translation), рачунарско препознавање говора (енг. Computer speech recognition), генерисање природног језика (енг. Natural language generation), анализа осећања (енг. sentiment analysis), итд. У рачунарској лингвистици данас је у употреби више различитих назива за област чији је предмет интересовања обрада осећања у тексту: класификација према осећању (енг. sentiment classification), истраживање мишљење (енг. opinion mining), анализа осећања (енг. sentiment analysis), екстракција осећања (енг. sentiment extraction). По својој природи и методама које користи, анализа осећања у тексту спада у област рачунарске лингвистике која се бави класификацијом текста. У процесу обраде осећања се, у општем случају, говори о три врсте класификације текстова:...The beginning of the new millennium was marked by huge development of social networks, internet technologies in the cloud and applications of artificial intelligence tools on the web. Extremely rapid growth in the number of articles on the Internet (blogs, e-commerce websites, forums, discussion groups, and systems for transmission of short messages, social networks and portals for publishing news) has increased the need for developing methods of rapid, comprehensive and accurate analysis of the text. Therefore, remarkable development of language technologies has enabled their applying in processes of document classification, document clustering, information retrieval, word sense disambiguation, text extraction, machine translation, computer speech recognition, natural language generation, sentiment analysis, etc. In computational linguistics, several different names for the area concerning processing of emotions in text are in use: sentiment classification, opinion mining, sentiment analysis, sentiment extraction. According to the nature and the methods used, sentiment analysis in text belongs to the field of computational linguistics that deals with the classification of text. In the process of analysing of emotions we generally speak of three kinds of text classification:..

    Structurally informed methods for improved sentiment analysis

    Get PDF
    Sentiment analysis deals with methods to automatically analyze opinions in natural language texts, e.g., product reviews. Such reviews contain a large number of fine-grained opinions, but to automatically extract detailed information it is necessary to handle a wide variety of verbalizations of opinions. The goal of this thesis is to develop robust structurally informed models for sentiment analysis which address challenges that arise from structurally complex verbalizations of opinions. In this thesis, we look at two examples for such verbalizations that benefit from including structural information into the analysis: negation and comparisons. Negation directly influences the polarity of sentiment expressions, e.g., while "good" is positive, "not good" expresses a negative opinion. We propose a machine learning approach that uses information from dependency parse trees to determine whether a sentiment word is in the scope of a negation expression. Comparisons like "X is better than Y" are the main topic of this thesis. We present a machine learning system for the task of detecting the individual components of comparisons: the anchor or predicate of the comparison, the entities that are compared, which aspect they are compared in, and which entity is preferred. Again, we use structural context from a dependency parse tree to improve the performance of our system. We discuss two ways of addressing the issue of limited availability of training data for our system. First, we create a manually annotated corpus of comparisons in product reviews, the largest such resource available to date. Second, we use the semi-supervised method of structural alignment to expand a small seed set of labeled sentences with similar sentences from a large set of unlabeled sentences. Finally, we work on the task of producing a ranked list of products that complements the isolated prediction of ratings and supports the user in a process of decision making. We demonstrate how we can use the information from comparisons to rank products and evaluate the result against two conceptually different external gold standard rankings.Sentimentanalyse befasst sich mit Methoden zur automatischen Analyse von Meinungen in Texten wie z.B. Produktbewertungen. Solche bewertenden Texte enthalten detaillierte Meinungsäußerungen. Um diese automatisch analysieren zu können müssen wir mit strukturell komplexen Äußerungen umgehen können. In dieser Arbeit präsentieren wir einen Ansatz für die robuste Analyse von komplexen Meinungsäußerungen mit Hilfe von Informationen aus der Satzstruktur. Wir betrachten zwei Beispiele für komplexe Meinungsäußerungen: Negationen und Vergleiche. Eine Negation hat direkten Einfluss auf die Polarität einer Meinungsäußerung in einem Satz. Während "gut" eine positive Meinung ausdrückt, ist "nicht gut" negativ. Wir präsentieren ein System, das auf maschinellem Lernen beruht und Informationen aus dem Satzstrukturbaum verwendet um für ein gegebenes Schlüsselwort festzustellen, ob im Kontext eine Negation vorkommt die die Polarität beeinflusst. Als zweites Beispiel für komplexe Meinungsäußerungen betrachten wir Vergleiche von Produkten, z.B. "X ist besser als Y". Wir präsentieren ein lernendes System, das die einzelnen Komponenten von Vergleichen identifiziert: Das Prädikat bzw. das Wort, das den Vergleich einführt, die beiden Entitäten, die verglichen werden, der Aspekt in dem sie verglichen werden, und welche Entität als besser bewertet wird. Auch hier verwenden wir Satzstrukturinformationen um die Erkennung zu verbessern. Ein Problem für die Anwendung von maschinellen Lernverfahren ist die eingeschränkte Verfügbarkeit von Trainingsdaten. Wir gehen dieses Problem auf zwei Arten an. Zum einen durch die Annotation eines eigenen Datensatzes von Vergleichen in Kamerabewertungen. Zum anderen indem wir eine halbüberwachte Methode einsetzen um eine kleine Menge von manuell annotierten Sätzen durch ähnliche Sätze aus einer großen Menge unannotierter Sätze zu ergänzen. Abschließend bearbeiten wir die Aufgabe, den Auswahlprozess eines Kunden zu unterstützen indem wir eine Rangfolge von Produkten erstellen. Wir demonstrieren, wie wir Vergleiche zu diesem Zweck nutzen können und evaluieren unser System gegen zwei konzeptionell unterschiedliche Rangfolgen aus externen Quellen
    corecore