17 research outputs found

    EusEduSeg: Un Segmentador Discursivo para el Euskera Basado en Dependencias

    Get PDF
    We present the first discursive segmenter for Basque implemented by heuristics based on syntactic dependencies and linguistic rules. Preliminary experiments show F1 values of more than 85% in automatic EDU segmentation for Basque.Presentamos en este artículo el primer segmentador discursivo para el euskera (EusEduSeg) implementado con heurísticas basadas en dependencias sintácticas y reglas lingüísticas. Experimentos preliminares muestran resultados de más del 85 % F1 en el etiquetado de EDUs sobre el Basque RST TreeBank

    Argument Compound Mining in Technical Texts: linguistic structures, implementation and annotation schemas

    Get PDF
    International audienceIn this paper, we motivate and develop the linguistic characteristics of argument compounds. The discourse structures that refine or elaborate arguments are analysed and their cognitive impact in argumentation is developed. An implementation is then presented. It is carried out in Dislog on the TextCoop platform. Dislog allows high level specifications in logic for fast and easy prototyping at a high level of linguistic adequacy. Elements of an indicative evaluation are provided

    The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations

    Get PDF
    In this paper, we report on the annotation procedures we developed for annotating the Turkish Discourse Bank (TDB), an effort that extends the Penn Discourse Tree Bank (PDTB) annotation style by using it for annotating Turkish discourse. After a brief introduction to the TDB, we describe the annotation cycle and the annotation scheme we developed, defining which parts of the scheme are an extension of the PDTB and which parts are different. We provide inter-coder reliability calculations on the first and second arguments of some connectives and discuss the most important sources of disagreement among annotators

    The annotation scheme of the Turkish Discourse Bank and an evaluation of inconsistent annotations

    Get PDF
    In this paper, we report on the annotation procedures we developed for annotating the Turkish Discourse Bank (TDB), an effort that extends the Penn Discourse Tree Bank (PDTB) annotation style by using it for annotating Turkish discourse. After a brief introduction to the TDB, we describe the annotation cycle and the annotation scheme we developed, defining which parts of the scheme are an extension of the PDTB and which parts are different. We provide inter-coder reliability calculations on the first and second arguments of some connectives and discuss the most important sources of disagreement among annotators

    Annotating Social Media Data From Vulnerable Populations: Evaluating Disagreement Between Domain Experts and Graduate Student Annotators

    Get PDF
    Researchers in computer science have spent considerable time developing methods to increase the accuracy and richness of annotations. However, there is a dearth in research that examines the positionality of the annotator, how they are trained and what we can learn from disagreements between different groups of annotators. In this study, we use qualitative analysis, statistical and computational methods to compare annotations between Chicago-based domain experts and graduate students who annotated a total of 1,851 tweets with images that are a part of a larger corpora associated with the Chicago Gang Intervention Study, which aims to develop a computational system that detects aggression and loss among gang-involved youth in Chicago. We found evidence to support the study of disagreement between annotators and underscore the need for domain expertise when reviewing Twitter data from vulnerable populations. Implications for annotation and content moderation are discussed

    Discourse structure analysis for requirement mining

    Get PDF
    International audienceIn this work, we first introduce two main approaches to writing requirements and then propose a method based on Natural Language Processing to improve requirement authoring and the overall coherence, cohesion and organization of requirement documents. We investigate the structure of requirement kernels, and then the discourse structure associated with those kernels. This will then enable the system to accurately extract requirements and their related contexts from texts (called requirement mining). Finally, we relate a first experimentation on requirement mining based on texts from seven companies. An evaluation that compares those results with manually annotated corpora of documents is given to conclude

    Identification of Justification Types and Discourse Markers in Turkish Language Teacher Candidates’ Argumentative Texts

    Get PDF
    The purpose of this research is to identify discourse markers used in justification types in Turkish language teacher candidates' argumentative texts. Survey model was used since it was aimed to determine the categories into which support and refutation justifications are split and to identify the discourse markers which express these categories. It is a descriptive field research in which qualitative data analysis techniques were employed. The pool of the research was obtained from the texts of 3rd and 4th year students (N=100) in Turkish Language Teaching Department at Mustafa Kemal University, in 2014-2015 academic year (N=100). "Argumentative Text Writing Form" and "Justification Type Identification Form" developed by researcher used as data collection tools. Texts were by using content analysis method. Descriptive statistics (frequency (f), percentage (%)) analyses were used in the analysis of the data collected in this study. According to the results of this study; a total of 275 justifications including support justifications (f = 225) and refutation justifications (f = 50) were presented in the texts given by the students. Justification types, which are most widely used for support justifications, are reasoning (% 19.11), addition (%14.22), exemplification (%13.77) and opposition (%11.55). The least used justification types are distinction (%2.66), condition (%2.22) and sequencing (%1.77). Opposition (%34) is the most widely used justification type for refutation justifications whereas conclusion, distinction and sequencing (%2) are the least used ones. Discourse markers (n=62) were identified in 11 justification types. Examples, where these discourse markers were employed, are presented in the study through sample sentences

    Discours, corpus, traitements automatiques

    Get PDF
    This chapter concerns the application of the methodological principles and methods of corpus linguistics to the study of text/discourse organisation. On the basis of the literature and the author's own research, it examines the specific corpus requirements and analytical difficulties for the discourse level. Discourse studies tend to be too analyst-dependent and small-scale, which makes them difficult to reproduce, their results difficult to generalise. The chapter goes on to look at connections between discourse studies, corpus analysis and language technology via applications such as automatic text summarization and aids to textual navigation. The quantitative techniques used in such systems deserve to be further explored in linguistic studies of text/discourse organisation. Another important direction for discourse research is the development of sharable resources, in particular corpora annotated with discourse structures and relations.Ce chapitre traite de l'application des principes et des méthodes des linguistiques de corpus à l'étude de l'organisation du texte/discours. A partir de la littérature du domaine et des travaux de l'auteur, il pose la question des exigences spécifiques en termes de corpus, et des difficultés d'analyse propres au niveau discursif. Les études sur le discours se caractérisent actuellement par une approche qualitative, sur des données de faible volume, avec des méthodes manuelles et donc subjectives, ce qui fait obstacle à leur reproductibilité – et partant à leur validation –, et à la généralisation de leurs résultats. Les interactions entre discours, traitement automatique des langues et analyses de corpus sont examinées à travers des applications comme le résumé automatique et l'aide à la navigation. Les questions posées par ces applications recoupent en de nombreux points celles qui motivent les études linguistiques du discours. Les techniques numériques auxquelles elles font appel mériteraient d'être explorées pour leur apport potentiel à l'étude linguistique de l'organisation du discours. Un dernier aspect particulièrement positif des linguistiques de corpus est l'accent mis sur la constitution de ressources collectives, en particulier de corpus enrichis d'annotations discursives (structures, relations de discours)

    Inter-Coder Agreement for Computational Linguistics

    Get PDF
    This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks—but that their use makes the interpretation of the value of the coefficient even harder. </jats:p

    The biomedical discourse relation bank

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource.</p> <p>Results</p> <p>We have developed the Biomedical Discourse Relation Bank (BioDRB), in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB), which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89). These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28), mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57).</p> <p>Conclusion</p> <p>Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more annotated data. The poor performance of a classifier trained in the open domain and tested in the biomedical domain suggests significant differences in the semantic usage of connectives across these domains, and provides robust evidence for a biomedical sublanguage for discourse and the need to develop a specialized biomedical discourse annotated corpus. The results of our cross-domain experiments are consistent with related work on identifying connectives in BioDRB.</p
    corecore