83 research outputs found

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    Tipping the scales: exploring the added value of deep semantic processing on readability prediction and sentiment analysis

    Get PDF
    Applications which make use of natural language processing (NLP) are said to benefit more from incorporating a rich model of text meaning than from a basic representation in the form of bag-of-words. This thesis set out to explore the added value of incorporating deep semantic information in two end-user applications that normally rely mostly on superficial and lexical information, viz. readability prediction and aspect-based sentiment analysis. For both applications we apply supervised machine learning techniques and focus on the incorporation of coreference and semantic role information. To this purpose, we adapted a Dutch coreference resolution system and developed a semantic role labeler for Dutch. We tested the cross-genre robustness of both systems and in a next phase retrained them on a large corpus comprising a variety of text genres. For the readability prediction task, we first built a general-purpose corpus consisting of a large variety of text genres which was then assessed on readability. Moreover, we proposed an assessment technique which has not previously been used in readability assessment, namely crowdsourcing, and revealed that crowdsourcing is a viable alternative to the more traditional assessment technique of having experts assign labels. We built the first state-of-the-art classification-based readability prediction system relying on a rich feature space of traditional, lexical, syntactic and shallow semantic features. Furthermore, we enriched this tool by introducing new features based on coreference resolution and semantic role labeling. We then explored the added value of incorporating this deep semantic information by performing two different rounds of experiments. In the first round these features were manually in- or excluded and in the second round joint optimization experiments were performed using a wrapper-based feature selection system based on genetic algorithms. In both setups, we investigated whether there was a difference in performance when these features were derived from gold standard information compared to when they were automatically generated, which allowed us to assess the true upper bound of incorporating this type of information. Our results revealed that readability classification definitely benefits from the incorporation of semantic information in the form of coreference and semantic role features. More precisely, we found that the best results for both tasks were achieved after jointly optimizing the hyperparameters and semantic features using genetic algorithms. Contrary to our expectations, we observed that our system achieved its best performance when relying on the automatically predicted deep semantic features. This is an interesting result, as our ultimate goal is to predict readability based exclusively on automatically-derived information sources. For the aspect-based sentiment analysis task, we developed the first Dutch end-to-end system. We therefore collected a corpus of Dutch restaurant reviews and annotated each review with aspect term expressions and polarity. For the creation of our system, we distinguished three individual subtasks: aspect term extraction, aspect category classification and aspect polarity classification. We then investigated the added value of our two semantic information layers in the second subtask of aspect category classification. In a first setup, we focussed on investigating the added value of performing coreference resolution prior to classification in order to derive which implicit aspect terms (anaphors) could be linked to which explicit aspect terms (antecedents). In these experiments, we explored how the performance of a baseline classifier relying on lexical information alone would benefit from additional semantic information in the form of lexical-semantic and semantic role features. We hypothesized that if coreference resolution was performed prior to classification, more of this semantic information could be derived, i.e. for the implicit aspect terms, which would result in a better performance. In this respect, we optimized our classifier using a wrapper-based approach for feature selection and we compared a setting where we relied on gold-standard anaphor-antecedent pairs to a setting where these had been predicted. Our results revealed a very moderate performance gain and underlined that incorporating coreference information only proves useful when integrating gold-standard coreference annotations. When coreference relations were derived automatically, this led to an overall decrease in performance because of semantic mismatches. When comparing the semantic role to the lexical-semantic features, it seemed that especially the latter features allow for a better performance. In a second setup, we investigated how to resolve implicit aspect terms. We compared a setting where gold-standard coreference resolution was used for this purpose to a setting where the implicit aspects were derived from a simple subjectivity heuristic. Our results revealed that using this heuristic results in a better coverage and performance, which means that, overall, it was difficult to find an added value in resolving coreference first. Does deep semantic information help tip the scales on performance? For Dutch readability prediction, we found that it does, when integrated in a state-of-the-art classifier. By using such information for Dutch aspect-based sentiment analysis, we found that this approach adds weight to the scales, but cannot make them tip

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    L'extraction d'information des sources de données non structurées et semi-structurées

    Get PDF
    L'objectif de la thèse: Dans le contexte des dépôts de connaissances de grandes dimensions récemment apparues, on exige l'investigation de nouvelles méthodes innovatrices pour résoudre certains problèmes dans le domaine de l'Extraction de l'Information (EI), tout comme dans d'autres sous-domaines apparentés. La thèse débute par un tour d'ensemble dans le domaine de l'Extraction de l'Information, tout en se concentrant sur le problème de l'identification des entités dans des textes en langage naturel. Cela constitue une démarche nécessaire pour tout système EI. L'apparition des dépôts de connaissances de grandes dimensions permet le traitement des sous-problèmes de désambigüisation au Niveau du Sens (WSD) et La Reconnaissance des Entités dénommées (NER) d'une manière unifiée. Le premier système implémenté dans cette thèse identifie les entités (les noms communs et les noms propres) dans un texte libre et les associe à des entités dans une ontologie, pratiquement, tout en les désambigüisant. Un deuxième système implémenté, inspiré par l'information sémantique contenue dans les ontologies, essaie, également, l'utilisation d'une nouvelle méthode pour la solution du problème classique de classement de texte, obtenant de bons résultats.Thesis objective: In the context of recently developed large scale knowledge sources (general ontologies), investigate possible new approaches to major areas of Information Extraction (IE) and related fields. The thesis overviews the field of Information Extraction and focuses on the task of entity recognition in natural language texts, a required step for any IE system. Given the availability of large knowledge resources in the form of semantic graphs, an approach that treats the sub-tasks of Word Sense Disambiguation and Named Entity Recognition in a unified manner is possible. The first implemented system using this approach recognizes entities (words, both common and proper nouns) from free text and assigns them ontological classes, effectively disambiguating them. A second implemented system, inspired by the semantic information contained in the ontologies, also attempts a new approach to the classic problem of text classification, showing good results

    A pilot study in an application of text mining to learning system evaluation

    Get PDF
    Text mining concerns discovering and extracting knowledge from unstructured data. It transforms textual data into a usable, intelligible format that facilitates classifying documents, finding explicit relationships or associations between documents, and clustering documents into categories. Given a collection of survey comments evaluating the civil engineering learning system, text mining technique is applied to discover and extract knowledge from the comments. This research focuses on the study of a systematic way to apply a software tool, SAS Enterprise Miner, to the survey data. The purpose is to categorize the comments into different groups in an attempt to identify major concerns from the users or students. Each group will be associated with a set of key terms. This is able to assist the evaluators of the learning system to obtain the ideas from those summarized terms without the need of going through a potentially huge amount of data --Abstract, page iii

    A distributional and syntactic approach to fine-grained opinion mining

    Get PDF
    This thesis contributes to a larger social science research program of analyzing the diffusion of IT innovations. We show how to automatically discriminate portions of text dealing with opinions about innovations by finding {source, target, opinion} triples in text. In this context, we can discern a list of innovations as targets from the domain itself. We can then use this list as an anchor for finding the other two members of the triple at a ``fine-grained'' level---paragraph contexts or less. We first demonstrate a vector space model for finding opinionated contexts in which the innovation targets are mentioned. We can find paragraph-level contexts by searching for an ``expresses-an-opinion-about'' relation between sources and targets using a supervised model with an SVM that uses features derived from a general-purpose subjectivity lexicon and a corpus indexing tool. We show that our algorithm correctly filters the domain relevant subset of subjectivity terms so that they are more highly valued. We then turn to identifying the opinion. Typically, opinions in opinion mining are taken to be positive or negative. We discuss a crowd sourcing technique developed to create the seed data describing human perception of opinion bearing language needed for our supervised learning algorithm. Our user interface successfully limited the meta-subjectivity inherent in the task (``What is an opinion?'') while reliably retrieving relevant opinionated words using labour not expert in the domain. Finally, we developed a new data structure and modeling technique for connecting targets with the correct within-sentence opinionated language. Syntactic relatedness tries (SRTs) contain all paths from a dependency graph of a sentence that connect a target expression to a candidate opinionated word. We use factor graphs to model how far a path through the SRT must be followed in order to connect the right targets to the right words. It turns out that we can correctly label significant portions of these tries with very rudimentary features such as part-of-speech tags and dependency labels with minimal processing. This technique uses the data from the crowdsourcing technique we developed as training data. We conclude by placing our work in the context of a larger sentiment classification pipeline and by describing a model for learning from the data structures produced by our work. This work contributes to computational linguistics by proposing and verifying new data gathering techniques and applying recent developments in machine learning to inference over grammatical structures for highly subjective purposes. It applies a suffix tree-based data structure to model opinion in a specific domain by imposing a restriction on the order in which the data is stored in the structure

    Opinion Mining of Sociopolitical Comments from Social Media

    Get PDF

    차원 축소를 이용한 편향적 문맥에서의 단어 클러스터링

    Get PDF
    학위논문 (석사) -- 서울대학교 대학원 : 인문대학 언어학과, 2020. 8. 신효필.편향성(Bias)은 어떤 사물, 사람 혹은 그룹 등에서 한쪽에 불균형적으로 주어지는 가중치라고 정의할 수 있다. 최근에는 기계학습에서의 편향성 문제와, 자연언어처리에서 이러한 편향성을 완화하고자 하는 연구에 대한 관심이 늘고 있다. 본 연구의 목표는 언어에 존재하는 편향성을 확인하고 워드 임베딩에서 그 편향성이 어떻게 표현되고 있는지 살펴보는 것이다. 본 연구에서 사용하는 데이터는 Wikipedia Neutrality Corpus(WNC)이고 이에 대한 워드 임베딩으로는 Pryzant et al.(2019)의 편향성을 제거하는 모듈러 모델(modular model)을 이용하였다. 또한 K-means Clustering을 이용하여 편향성 정보를 포함한 v 벡터를 추가하기 전과 후의 워드 임베딩을 시각화하였고, 클러스터링(Clustering) 성능의 개선을 위해 주성분분석(Principal Component Analysis/PCA)을 사용하였다. 본 연구에서는 워드 임베딩에서 언어적 특징에 따라 클러스터링 되는 것과 같이 편향성을 갖는 단어들 역시 편향성의 유형(인식론적 편향성, 프레이밍에 따른 편향성, 인구학적 편향성 등)에 따라서 클러스터링 된다는 것을 확인할 수 있었다. 또한, 워드 임베딩이 모듈러 모델의 고유한 v 벡터와 결합할 경우 다양한 언어 정보를 포함하게 되므로, 이러한 연구는 편향성을 인식하고 제거하는 task뿐만 아니라 문맥(context) 정보를 이해하는 데에도 도움이 될 것이다.Bias can be defined as disproportionate weight in favor of or against one thing, person, or group compared with another. Recently, the issue of bias in machine learning and how to de-bias natural language processing has been a topic of increasing interest. This research examines bias in language, the effect of context on biased-judgements, and the clustering of biased- and neutral-judged words taken from biased contexts. The data for this study comes from the Wikipedia Neutrality Corpus (WNC) and its representation as word embeddings is from the bias neutralizing modular model by Pryzant et al. (2019). Visualization of the embeddings is done using K-means clustering to compare before and after the addition of the v vector, which holds bias information. Principal Component Analysis (PCA) is also used in an attempt to boost performance of clustering. This study finds that because the word embeddings cluster according linguistic features, the biased words also cluster according to bias type: epistemological bias, framing bias, and demographic bias. It also presents evidence that the word embeddings after being combined with the unique v vector from the modular model contain discrete linguistic information that helps not only in the task of detecting and neutralizing bias, but also recognizing context.1. Introduction 1 1.1. What is Bias? 1 1.2. De-biasing Techniques 7 1.3. Purpose and Significance of this Study 11 2. Background Information 13 2.1. Previous Research 13 2.2. Wikipedia Neutrality Corpus 18 2.3. Modular Model 20 2.4. Methodology 24 2.2.1. Clustering 25 2.2.2. Dimensionality Reduction Algorithm 28 3. Experiment 35 4. Results 43 4.1. Clustering of Entire Data Set 43 4.1.1. Most Frequently Biased-Judged Words 52 4.1.2. Cosine Similarity 58 4.2. Clustering of Small Random Sample 66 4.3. Significance of Results 69 5. Conclusion 71 References 73 Appendix 80 Abstract in Korean 83Maste
    corecore