14 research outputs found

    A discursive grid approach to model local coherence in multi-document summaries

    Get PDF
    Multi-document summarization is a very important area of Natural Language Processing (NLP) nowadays because of the huge amount of data in the web. People want more and more information and this information must be coherently organized and summarized. The main focus of this paper is to deal with the coherence of multi-document summaries. Therefore, a model that uses discursive information to automatically evaluate local coherence in multi-document summaries has been developed. This model obtains 92.69% of accuracy in distinguishing coherent from incoherent summaries, outperforming the state of the art in the area.CAPESFAPESPUniversity of Goiá

    Cross-lingual RST Discourse Parsing

    Get PDF
    Discourse parsing is an integral part of understanding information flow and argumentative structure in documents. Most previous research has focused on inducing and evaluating models from the English RST Discourse Treebank. However, discourse treebanks for other languages exist, including Spanish, German, Basque, Dutch and Brazilian Portuguese. The treebanks share the same underlying linguistic theory, but differ slightly in the way documents are annotated. In this paper, we present (a) a new discourse parser which is simpler, yet competitive (significantly better on 2/3 metrics) to state of the art for English, (b) a harmonization of discourse treebanks across languages, enabling us to present (c) what to the best of our knowledge are the first experiments on cross-lingual discourse parsing.Comment: To be published in EACL 2017, 13 page

    Semi-supervised never-ending learning in rhetorical relation identification

    Get PDF
    Some languages do not have enough labeled data to obtain good discourse parsing, specially in the relation identification step, and the additional use of unlabeled data is a plausible solution. A workflow is presented that uses a semi-supervised learning approach. Instead of only a pre-defined additional set of unlabeled data, texts obtained from the web are continuously added. This obtains near human perfomance (0.79) in intra sentential rhetorical relation identification. An experiment for English also shows improvement using a similar workflow.São Paulo Research Foundation (FAPESP) (grant♯2014/11632)Natural Sciences and Engineering Research Council of CanadaUniversity of Toront

    Linguistic Tests for Discourse Relations

    Get PDF
    Discourse structure and discourse relations are an important ingredient in systems for the analysis of text that go beyond the boundary of single clauses. Discourse relations often indicate important additional information about the connection between two clauses, such as causality, and are widely believed to have an influence on aspects of reference resolution.In this article, we first present the general design choices that are to be made in the design of an annotation scheme for discourse structure and discourse relations. In a second part, we present the scheme used in our annotation of selected articles from the TüBa-D/Z treebank of German (Telljohann et al., 2009). The scheme used in the annotation is theory-neutral, but informed by more detailed linguistic knowledge in the way of linguistic tests that can help disambiguate between several plausible relations

    Aprendizagem à distância de anáfora em inglês e espanhol como línguas estrangeiras

    Get PDF
    A presente tese de doutoramento investiga a aprendizagem à distância de anáfora em inglês e espanhol como línguas estrangeiras. Analisa-se como falantes nativos de português, aprendizes de inglês ou espanhol, compreendem e produzem anáforas com antecedentes nominais em textos escritos e como diferentes modalidades de ensino à distância podem contribuir para a aprendizagem deste mecanismo discursivo. Ao todo, foram escritos 11 artigos, distribuídos em 4 seções. A primeira seção tem como foco a investigação da resolução de ambiguidade com base em um questionário online distribuído a aprendizes e falantes nativos de português, inglês e espanhol. Enquanto o primeiro texto foi um estudo-piloto realizado em Portugal, o segundo incluiu dados do Brasil, e o terceiro foi escrito após a coleta ser concluída. Nos questionários, foi possível controlar diversas variáveis para analisar como os falantes resolviam a ambiguidade anafórica. A segunda seção destina-se à revisão da literatura sobre o ensino-aprendizagem da anáfora, as teorias e métodos voltados ao ensino de línguas, e as diferentes modalidades de ensino. Estes estudos permitiram a elaboração conceitual do experimento realizado posteriormente. Finalmente, a terceira seção da tese trata do experimento realizado, que consistiu na oferta de um curso sobre anáfora nas modalidades de ensino à distância síncrona e assíncrona, com acompanhamento da aprendizagem ao longo do tempo. O primeiro artigo explica como o curso foi planejado; o segundo apresenta os resultados dos grupos nos testes de compreensão; e o terceiro avalia o curso qualitativamente. A quarta seção apresenta os corpora de aprendizagem compilados, BRANEN e BRANES, e a análise das relações anafóricas produzidas pelos estudantes ao longo de quatro testes (um pré-teste, um teste intermédio, um teste imediatamente final, e um teste de retenção após um mês). A tese conclui-se com uma sinopse dos resultados obtidos, sua discussão e uma conclusão perspectivando linhas de investigação futuras.This doctoral thesis investigates the distance learning of anaphora in English and Spanish as foreign languages. It analyses how native speakers of Portuguese, learners of English or Spanish, understand and produce anaphora with nominal antecedents in written texts and how different distance learning modalities can contribute to the learning of this discursive mechanism. In total, 11 articles were written and distributed in 4 sections. The first section focuses on investigating ambiguity resolution based on an online questionnaire distributed to learners and native speakers of Portuguese, English, and Spanish. While the first paper presents a pilot study conducted in Portugal, the second included data from Brazil, and the third was written after the data collection was completed. In the questionnaires, it was possible to control several variables to analyse how speakers resolved anaphoric ambiguity. The second section reviews the literature on the teaching and learning of anaphora, the theories and methods focused on language teaching, and the different teaching modalities. These studies allowed the conceptual elaboration of the experiment carried out later. Finally, the third section of the thesis presents the experiment carried out, which consisted in offering a course on anaphora in synchronous and asynchronous distance learning modalities, with monitoring of learning over time. The first article explains how the course was planned; the second presents the groups’ results in the comprehension tests; and the third evaluated the course qualitatively. The fourth section presents the new learner corpora, BRANEN and BRANES, and the analysis of the anaphoric relations produced by the students over four tests (a pre-test, an intermediate test, an immediately final test, and a retention test after one month). The thesis ends with a synopsis of the results obtained, their discussion, and a conclusion looking towards future lines of research

    Supervision distante pour l'apprentissage de structures discursives dans les conversations multi-locuteurs

    Get PDF
    L'objectif principal de cette thèse est d'améliorer l'inférence automatique pour la modélisation et la compréhension des communications humaines. En particulier, le but est de faciliter considérablement l'analyse du discours afin d'implémenter, au niveau industriel, des outils d'aide à l'exploration des conversations. Il s'agit notamment de la production de résumés automatiques, de recommandations, de la détection des actes de dialogue, de l'identification des décisions, de la planification et des relations sémantiques entre les actes de dialogue afin de comprendre les dialogues. Dans les conversations à plusieurs locuteurs, il est important de comprendre non seulement le sens de l'énoncé d'un locuteur et à qui il s'adresse, mais aussi les relations sémantiques qui le lient aux autres énoncés de la conversation et qui donnent lieu à différents fils de discussion. Une réponse doit être reconnue comme une réponse à une question particulière ; un argument, comme un argument pour ou contre une proposition en cours de discussion ; un désaccord, comme l'expression d'un point de vue contrasté par rapport à une autre idée déjà exprimée. Malheureusement, les données de discours annotées à la main et de qualités sont coûteuses et prennent du temps, et nous sommes loin d'en avoir assez pour entraîner des modèles d'apprentissage automatique traditionnels, et encore moins des modèles d'apprentissage profond. Il est donc nécessaire de trouver un moyen plus efficace d'annoter en structures discursives de grands corpus de conversations multi-locuteurs, tels que les transcriptions de réunions ou les chats. Un autre problème est qu'aucune quantité de données ne sera suffisante pour permettre aux modèles d'apprentissage automatique d'apprendre les caractéristiques sémantiques des relations discursives sans l'aide d'un expert ; les données sont tout simplement trop rares. Les relations de longue distance, dans lesquelles un énoncé est sémantiquement connecté non pas à l'énoncé qui le précède immédiatement, mais à un autre énoncé plus antérieur/tôt dans la conversation, sont particulièrement difficiles et rares, bien que souvent centrales pour la compréhension. Notre objectif dans cette thèse a donc été non seulement de concevoir un modèle qui prédit la structure du discours pour une conversation multipartite sans nécessiter de grandes quantités de données annotées manuellement, mais aussi de développer une approche qui soit transparente et explicable afin qu'elle puisse être modifiée et améliorée par des experts.The main objective of this thesis is to improve the automatic capture of semantic information with the goal of modeling and understanding human communication. We have advanced the state of the art in discourse parsing, in particular in the retrieval of discourse structure from chat, in order to implement, at the industrial level, tools to help explore conversations. These include the production of automatic summaries, recommendations, dialogue acts detection, identification of decisions, planning and semantic relations between dialogue acts in order to understand dialogues. In multi-party conversations it is important to not only understand the meaning of a participant's utterance and to whom it is addressed, but also the semantic relations that tie it to other utterances in the conversation and give rise to different conversation threads. An answer must be recognized as an answer to a particular question; an argument, as an argument for or against a proposal under discussion; a disagreement, as the expression of a point of view contrasted with another idea already expressed. Unfortunately, capturing such information using traditional supervised machine learning methods from quality hand-annotated discourse data is costly and time-consuming, and we do not have nearly enough data to train these machine learning models, much less deep learning models. Another problem is that arguably, no amount of data will be sufficient for machine learning models to learn the semantic characteristics of discourse relations without some expert guidance; the data are simply too sparse. Long distance relations, in which an utterance is semantically connected not to the immediately preceding utterance, but to another utterance from further back in the conversation, are particularly difficult and rare, though often central to comprehension. It is therefore necessary to find a more efficient way to retrieve discourse structures from large corpora of multi-party conversations, such as meeting transcripts or chats. This is one goal this thesis achieves. In addition, we not only wanted to design a model that predicts discourse structure for multi-party conversation without requiring large amounts of hand-annotated data, but also to develop an approach that is transparent and explainable so that it can be modified and improved by experts. The method detailed in this thesis achieves this goal as well
    corecore