19 research outputs found

    La ressource ANNODIS, un corpus enrichi d'annotations discursives

    Get PDF
    This paper describes the ANNODIS ressource, a corpus of written French enriched with several markups, including a manual annotation of discourse structures. The resource is original in that it offers a diversified corpus representing several text types, and two annotations based on different approaches to discourse organisation. As well as a description of the ressource - annotated objects, composition of the corpus - the paper presents the theoretical underpinnings of the annotation models and the methodological choices underlying corpus preparation and annotation. It also sketches the potential contribution of such a resource for linguistics and NLP, and describes initial results of its exploitation.Cet article dĂ©crit la ressource ANNODIS, issue d'un projet ïŹnancĂ© par l'ANR, corpus de français Ă©crit enrichi Ă  diffĂ©rents niveaux, dont un niveau d'annotation manuelle de structures discursives. Une originalitĂ© de la ressource est de proposer un corpus diversiïŹĂ© (plusieurs types de textes sont reprĂ©sentĂ©s) et deux annotations fondĂ©es sur des approches distinctes de la structuration des discours. La description de la ressource - objets annotĂ©s, textes composant le corpus - s'accompagne de la prĂ©sentation des ancrages thĂ©oriques sous-jacents aux modĂšles d'annotation, et des choix mĂ©thodologiques qui ont guidĂ© les diverses phases de prĂ©paration et d'annotation du corpus. Nous formulons les enjeux d'une telle ressource pour la linguistique et le TAL, et prĂ©sentons les premiĂšres exploitations

    An empirical resource for discovering cognitive principles of discourse organisation: the ANNODIS corpus

    Get PDF
    International audienceThis paper describes the ANNODIS resource, a discourse-level annotated corpus for French. The corpus combines two perspectives on discourse: a bottom-up approach and a top-down approach. The bottom-up view incrementally builds a structure from elementary discourse units, while the top-down view focuses on the selective annotation of multi-level discourse structures. The corpus is composed of texts that are diversified with respect to genre, length and type of discursive organisation. The methodology followed here involves an iterative design of annotation guidelines in order to reach satisfactory inter-annotator agreement levels. This allows us to raise a few issues relevant for the comparison of such complex objects as discourse structures. The corpus also serves as a source of empirical evidence for discourse theories. We present here two first analyses taking advantage of this new annotated corpus --one that tested hypotheses on constraints governing discourse structure, and another that studied the variations in composition and signalling of multi-level discourse structures

    Exploiting naive vs expert discourse annotations: an experiment using lexical cohesion to predict Elaboration / Entity-Elaboration confusions

    Get PDF
    International audienceExploiting naive vs expert discourse annotations: an experiment using lexical cohesion to predict Elaboration / Entity-Elaboration confusion

    Evaluation in Discourse: a Corpus-Based Study

    Get PDF
    This paper describes the CASOAR corpus, the first manually annotated corpus that explores the impact of discourse structure on sentiment analysis with a study of movie reviews in French and in English as well as letters to the editor in French. While annotating opinions at the expression, the sentence or the document level is a well-established task and relatively straightforward, discourse annotation remains difficult, especially for non-experts. Therefore, combining both annotations poses several methodological problems that we address here. We propose a multi-layered annotation scheme that includes: the complete discourse structure according to the Segmented Discourse Representation Theory, the opinion orientation of elementary discourse units and opinion expressions, and their associated features. We detail each layer, explore the interactions between them and discuss our results. In particular, we examine the correlation between discourse and semantic category of opinion expressions, the impact of discourse relations on both subjectivity and polarity analysis and the impact of discourse on the determination of the overall opinion of a document. Our results demonstrate that discourse is an important cue for sentiment analysis, at least for the corpus genres we have studied

    An empirical approach to the signalling of enumerative structures

    Get PDF
    International audienceThis paper presents a data-intensive study of the signalling of enumerative structures. In contrast with semasiological studies of specific markers, the approach described here takes as its starting point annotated structures and cues, seeking to identify recurrent patterns in these data. To do so, it exploits a new resource for French, the ANNODIS resource, a large corpus of written texts manually annotated at discourse level. The data analysed - first quantitatively with large populations, then qualitatively on selected examples - allows the authors to illustrate how cues involved in signalling text organisation combine in complex ways metadiscourse and propositional content, or the textual and ideational metafunctions.Nous présentons dans cet article une analyse extensive sur corpus de la signalisation des structures énumératives. Notre étude diverge par rapport aux travaux antérieurs principalement caractérisés par une approche sémasiologique de marqueurs spécifiques, car elle se fonde sur une annotation manuelle systématique des structures et des indices. C'est à partir de ces données annotées que nous cherchons des motifs récurrents de signalisation. Nous exploitons une ressource récemment créée, la ressource ANNODIS, corpus de français écrit enrichi d'annotations discursives. Les données analysées - de maniÚre quantitative d'abord pour embrasser des effectifs importants, puis de maniÚre qualitative sur une sélection d'exemples - nous permettent de montrer que les indices qui contribuent à l'organisation textuelle peuvent associer métadiscours et contenu propositionnel, en d'autres termes les fonctions textuelle et idéationnelle

    Izrada OWL ontologije za prikaz, povezivanje i pretraĆŸivanje SemAF diskursnih oznaka

    Get PDF
    Linguistic Linked Open Data (LLOD) are technologies that provide a powerful instrument for representing and interpreting language phenomena on a web-scale. The main objective of this paper is to demonstrate how LLOD technologies can be applied to represent and annotate a corpus composed of multiword discourse markers, and what the effects of this are. In particular, it is our aim to apply semantic web standards such as RDF and OWL for publishing and integrating data. We present a novel scheme for discourse annotation that combines ISO standards describing discourse relations and dialogue acts – ISO DR-Core (ISO 24617-8) and ISO-Dialogue Acts (ISO 24617-2) in 9 languages (cf. Silvano and Damova 2022; Silvano, et al. 2022). We develop an OWL ontology to formalize that scheme, provide a newly annotated dataset and link its RDF edition with the ontology. Consequently, we describe the conjoint querying of the ontology and the annotations by means of SPARQL, the standard query language for the web of data. The ultimate result is that we are able to perform queries over multiple, interlinked datasets with complex internal structure. This is a first, but essential step, in developing novel, powerful, and groundbreaking means for the corpus-based study of multilingual discourse, communication analysis, or attitudes discovery.Diskursni markeri jezični su znakovi koji pokazuju kako se iskaz odnosi na kontekst diskursa i koju ulogu ima u razgovoru. Lingvistički povezani otvoreni podatci (LLOD) tehnologije su u nastajanju koje omogućuju snaĆŸan instrument za prikaz i tumačenje jezičnih fenomena na razini weba. Glavni je cilj ovoga rada pokazati kako se tehnologije lingvistički povezanih otvorenih podataka (LLOD) mogu primijeniti za prikaz i označavanje korpusa viĆĄerječnih diskursnih markera te koji su učinci toga. Konkretno, naĆĄ je cilj primijeniti standarde semantičkoga weba kao ĆĄto su RDF i Web Ontology Language (OWL) za objavljivanje i integraciju podataka. Autori predstavljaju novu shemu za označavanje diskursa koja kombinira ISO standarde za opis diskursnih odnosa i dijaloĆĄkih činova – ISO DR-Core (ISO 24617-8) i ISO-Dialogue Acts (ISO 24617-2) na devet jezika (usp. Silvano, Purificação et al. 2022a; Silvano, Purificação et al. 2022b). Razvijamo OWL ontologiju kako bismo formalizirali tu shemu, pruĆŸili nov označeni skup podataka i povezali njegovu RDF inačicu s ontologijom. U skladu s tim opisujemo zajedničko postavljanje upita ontologiji i oznakama s pomoću SPARQL-a, standardnoga jezika upita za web podataka. Konačni je rezultat taj da moĆŸemo izvrĆĄiti upite nad viĆĄestrukim, međusobno povezanim skupovima podataka sa sloĆŸenom unutarnjom strukturom bez potrebe za ikakvim specijaliziranim softverom. Umjesto toga upotrebljavaju se gotove tehnologije utemeljene na web standardima koje se bez napora mogu prenijeti na različite operativne sustave, baze podataka i programske jezike. Ovo je prvi, ali prijeloman korak u razvoju novih, snaĆŸnih i (u određenom trenutku) pristupačnih sredstava za korpusno utemeljena istraĆŸivanja viĆĄejezičnoga diskursa te za analizu komunikacije i otkrivanje stavova

    Learning Explicit and Implicit Arabic Discourse Relations.

    Get PDF
    We propose in this paper a supervised learning approach to identify discourse relations in Arabic texts. To our knowledge, this work represents the first attempt to focus on both explicit and implicit relations that link adjacent as well as non adjacent Elementary Discourse Units (EDUs) within the Segmented Discourse Representation Theory (SDRT). We use the Discourse Arabic Treebank corpus (D-ATB) which is composed of newspaper documents extracted from the syntactically annotated Arabic Treebank v3.2 part3 where each document is associated with complete discourse graph according to the cognitive principles of SDRT. Our list of discourse relations is composed of a three-level hierarchy of 24 relations grouped into 4 top-level classes. To automatically learn them, we use state of the art features whose efficiency has been empirically proved. We investigate how each feature contributes to the learning process. We report our experiments on identifying fine-grained discourse relations, mid-level classes and also top-level classes. We compare our approach with three baselines that are based on the most frequent relation, discourse connectives and the features used by Al-Saif and Markert (2011). Our results are very encouraging and outperform all the baselines with an F-score of 78.1% and an accuracy of 80.6%

    Unsupervised extraction of semantic relations using discourse information

    Get PDF
    La comprĂ©hension du langage naturel repose souvent sur des raisonnements de sens commun, pour lesquels la connaissance de relations sĂ©mantiques, en particulier entre prĂ©dicats verbaux, peut ĂȘtre nĂ©cessaire. Cette thĂšse porte sur la problĂ©matique de l'utilisation d'une mĂ©thode distributionnelle pour extraire automatiquement les informations sĂ©mantiques nĂ©cessaires Ă  ces infĂ©rences de sens commun. Des associations typiques entre des paires de prĂ©dicats et un ensemble de relations sĂ©mantiques (causales, temporelles, de similaritĂ©, d'opposition, partie/tout) sont extraites de grands corpus, par l'exploitation de la prĂ©sence de connecteurs du discours signalant typiquement ces relations. Afin d'apprĂ©cier ces associations, nous proposons plusieurs mesures de signifiance inspirĂ©es de la littĂ©rature ainsi qu'une mesure novatrice conçue spĂ©cifiquement pour Ă©valuer la force du lien entre les deux prĂ©dicats et la relation. La pertinence de ces mesures est Ă©valuĂ©e par le calcul de leur corrĂ©lation avec des jugements humains, obtenus par l'annotation d'un Ă©chantillon de paires de verbes en contexte discursif. L'application de cette mĂ©thodologie sur des corpus de langue française et anglaise permet la construction d'une ressource disponible librement, Lecsie (Linked Events Collection for Semantic Information Extraction). Celle-ci est constituĂ©e de triplets: des paires de prĂ©dicats associĂ©s Ă  une relation; Ă  chaque triplet correspondent des scores de signifiance obtenus par nos mesures.Cette ressource permet de dĂ©river des reprĂ©sentations vectorielles de paires de prĂ©dicats qui peuvent ĂȘtre utilisĂ©es comme traits lexico-sĂ©mantiques pour la construction de modĂšles pour des applications externes. Nous Ă©valuons le potentiel de ces reprĂ©sentations pour plusieurs applications. Concernant l'analyse du discours, les tĂąches de la prĂ©diction d'attachement entre unitĂ©s du discours, ainsi que la prĂ©diction des relations discursives spĂ©cifiques les reliant, sont explorĂ©es. En utilisant uniquement les traits provenant de notre ressource, nous obtenons des amĂ©liorations significatives pour les deux tĂąches, par rapport Ă  plusieurs bases de rĂ©fĂ©rence, notamment des modĂšles utilisant d'autres types de reprĂ©sentations lexico-sĂ©mantiques. Nous proposons Ă©galement de dĂ©finir des ensembles optimaux de connecteurs mieux adaptĂ©s Ă  des applications sur de grands corpus, en opĂ©rant une rĂ©duction de dimension dans l'espace des connecteurs, au lieu d'utiliser des groupes de connecteurs composĂ©s manuellement et correspondant Ă  des relations prĂ©dĂ©finies. Une autre application prometteuse explorĂ©e dans cette thĂšse concerne les relations entre cadres sĂ©mantiques (semantic frames, e.g. FrameNet): la ressource peut ĂȘtre utilisĂ©e pour enrichir cette structure par des relations potentielles entre frames verbaux Ă  partir des associations entre leurs verbes. Ces applications diverses dĂ©montrent les contributions prometteuses amenĂ©es par notre approche permettant l'extraction non supervisĂ©e de relations sĂ©mantiques.Natural language understanding often relies on common-sense reasoning, for which knowledge about semantic relations, especially between verbal predicates, may be required. This thesis addresses the challenge of using a distibutional method to automatically extract the necessary semantic information for common-sense inference. Typical associations between pairs of predicates and a targeted set of semantic relations (causal, temporal, similarity, opposition, part/whole) are extracted from large corpora, by exploiting the presence of discourse connectives which typically signal these semantic relations. In order to appraise these associations, we provide several significance measures inspired from the literature as well as a novel measure specifically designed to evaluate the strength of the link between the two predicates and the relation. The relevance of these measures is evaluated by computing their correlations with human judgments, based on a sample of verb pairs annotated in context. The application of this methodology to French and English corpora leads to the construction of a freely available resource, Lecsie (Linked Events Collection for Semantic Information Extraction), which consists of triples: pairs of event predicates associated with a relation; each triple is assigned significance scores based on our measures. From this resource, vector-based representations of pairs of predicates can be induced and used as lexical semantic features to build models for external applications. We assess the potential of these representations for several applications. Regarding discourse analysis, the tasks of predicting attachment of discourse units, as well as predicting the specific discourse relation linking them, are investigated. Using only features from our resource, we obtain significant improvements for both tasks in comparison to several baselines, including ones using other representations of the pairs of predicates. We also propose to define optimal sets of connectives better suited for large corpus applications by performing a dimension reduction in the space of the connectives, instead of using manually composed groups of connectives corresponding to predefined relations. Another promising application pursued in this thesis concerns relations between semantic frames (e.g. FrameNet): the resource can be used to enrich this sparse structure by providing candidate relations between verbal frames, based on associations between their verbs. These diverse applications aim to demonstrate the promising contributions provided by our approach, namely allowing the unsupervised extraction of typed semantic relations
    corecore