4 research outputs found

    Méthodes neuronales pour l'extraction d'événements

    No full text
    With the increasing amount of data and the exploding number data sources, the extraction of information about events, whether from the perspective of acquiring knowledge or from a more directly operational perspective, becomes a more and more obvious need. This extraction nevertheless comes up against a recurring difficulty: most of the information is present in documents in a textual form, thus unstructured and difficult to be grasped by the machine. From the point of view of Natural Language Processing (NLP), the extraction of events from texts is the most complex form of Information Extraction (IE) techniques, which more generally encompasses the extraction of named entities and relationships that bind them in the texts. The event extraction task can be represented as a complex combination of relations linked to a set of empirical observations from texts. Compared to relations involving only two entities, there is, therefore, a new dimension that often requires going beyond the scope of the sentence, which constitutes an additional difficulty. In practice, an event is described by a trigger and a set of participants in that event whose values are text excerpts. While IE research has benefited significantly from manually annotated datasets to learn patterns for text analysis, the availability of these resources remains a significant problem. These datasets are often obtained through the sustained efforts of research communities, potentially complemented by crowdsourcing. In addition, many machine learning-based IE approaches rely on the ability to extract large sets of manually defined features from text using sophisticated NLP tools. As a result, adaptation to a new domain is an additional challenge. This thesis presents several strategies for improving the performance of an Event Extraction (EE) system using neural-based approaches exploiting morphological, syntactic, and semantic properties of word embeddings. These have the advantage of not requiring a priori modeling domain knowledge and automatically generate a much larger set of features to learn a model. More specifically, we proposed different deep learning models for two sub-tasks related to EE: event detection and argument detection and classification. Event Detection (ED) is considered an important subtask of event extraction since the detection of arguments is very directly dependent on its outcome. ED specifically involves identifying instances of events in texts and classifying them into specific event types. Classically, the same event may appear as different expressions and these expressions may themselves represent different events in different contexts, hence the difficulty of the task. The detection of the arguments is based on the detection of the expression considered as triggering the event and ensures the recognition of the participants of the event. Among the difficulties to take into account, it should be noted that an argument can be common to several events and that it does not necessarily identify with an easily recognizable named entity. As a preliminary to the introduction of our proposed models, we begin by presenting in detail a state-of-the-art model which constitutes the baseline. In-depth experiments are conducted on the use of different types of word embeddings and the influence of the different hyperparameters of the model using the ACE 2005 evaluation framework, a standard evaluation for this task. We then propose two new models to improve an event detection system. One allows increasing the context taken into account when predicting an event instance by using a sentential context, while the other exploits the internal structure of words by taking advantage of seemingly less obvious but essentially important morphological knowledge. We also reconsider the detection of arguments as a high-order relation extraction and we analyze the dependence of arguments on the ED task.Du point de vue du traitement automatique des langues (TAL), l’extraction des Ă©vĂ©nements dans les textes est la forme la plus complexe des processus d’extraction d’information, qui recouvrent de façon plus gĂ©nĂ©rale l’extraction des entitĂ©s nommĂ©es et des relations qui les lient dans les textes. Le cas des Ă©vĂ©nements est particuliĂšrement ardu car un Ă©vĂ©nement peut ĂȘtre assimilĂ© Ă  une relation n-aire ou Ă  une configuration de relations. Alors que la recherche en extraction d’information a largement bĂ©nĂ©ficiĂ© des jeux de donnĂ©es Ă©tiquetĂ©s manuellement pour apprendre des modĂšles permettant l’analyse des textes, la disponibilitĂ© de ces ressources reste un problĂšme important. En outre, de nombreuses approches en extraction d’information fondĂ©es sur l’apprentissage automatique reposent sur la possibilitĂ© d’extraire Ă  partir des textes de larges en sembles de traits dĂ©finis manuellement grĂące Ă  des outils de TAL Ă©laborĂ©s. De ce fait, l’adaptation Ă  un nouveau domaine constitue un dĂ©fi supplĂ©mentaire. Cette thĂšse prĂ©sente plusieurs stratĂ©gies pour amĂ©liorer la performance d’un systĂšme d’extraction d’évĂ©nements en utilisant des approches fondĂ©es sur les rĂ©seaux de neurones et en exploitant les propriĂ©tĂ©s morphologiques, syntaxiques et sĂ©mantiques des plongements de mots. Ceux-ci ont en effet l’avantage de ne pas nĂ©cessiter une modĂ©lisation a priori des connaissances du domaine et de gĂ©nĂ©rer automatiquement un ensemble de traits beaucoup plus vaste pour apprendre un modĂšle. Nous avons proposĂ© plus spĂ©cifiquement diffĂ©rents modĂšles d’apprentissage profond pour les deux sous-tĂąches liĂ©es Ă  l’extraction d’évĂ©nements : la dĂ©tection d’évĂ©nements et la dĂ©tection d’arguments. La dĂ©tection d’évĂ©nements est considĂ©rĂ©e comme une sous-tĂąche importante de l’extraction d’évĂ©nements dans la mesure oĂč la dĂ©tection d’arguments est trĂšs directement dĂ©pendante de son rĂ©sultat. La dĂ©tection d’évĂ©nements consiste plus prĂ©cisĂ©ment Ă  identifier des instances d’évĂ©nements dans les textes et Ă  les classer en types d’évĂ©nements prĂ©cis. En prĂ©alable Ă  l’introduction de nos nouveaux modĂšles, nous commençons par prĂ©senter en dĂ©tail le modĂšle de l’état de l’art qui en constitue la base. Des expĂ©riences approfondies sont menĂ©es sur l’utilisation de diffĂ©rents types de plongements de mots et sur l’influence des diffĂ©rents hyperparamĂštres du modĂšle en nous appuyant sur le cadre d’évaluation ACE 2005, standard d’évaluation pour cette tĂąche. Nous proposons ensuite deux nouveaux modĂšles permettant d’amĂ©liorer un systĂšme de dĂ©tection d’évĂ©nements. L’un permet d’augmenter le contexte pris en compte lors de la prĂ©diction d’une instance d’évĂ©nement (dĂ©clencheur d’évĂ©nement) en utilisant un contexte phrastique, tandis que l’autre exploite la structure interne des mots en profitant de connaissances morphologiques en apparence moins nĂ©cessaires mais dans les faits importantes. Nous proposons enfin de reconsidĂ©rer la dĂ©tection des arguments comme une extraction de relation d’ordre supĂ©rieur et nous analysons la dĂ©pendance de cette dĂ©tection vis-Ă -vis de la dĂ©tection d’évĂ©nements

    Impact Analysis of Document Digitization on Event Extraction

    No full text
    International audienceThis paper tackles the epidemiological event extraction task applied to digitized documents. Event extraction is an information extraction task that focuses on identifying event mentions from textual data. In the context of event-based health surveillance from digitized documents, several key issues remain challenging in spite of great efforts. First, image documents are indexed through their digitized version and thus, they may contain numerous errors, e.g. misspellings. Second, it is important to address international news, which would imply the inclusion of multilingual data. To clarify these important aspects of how to extract epidemic-related events, it remains necessary to maximize the use of digitized data. In this paper, we investigate the impact of working with digitized multilingual documents with dierent levels of synthetic noise over the performance of an event extraction system. This type of analysis, to our knowledge, has not been alleviated in previous research

    Automatic page classification in a large collection of manuscripts based on the International Image Interoperability Framework

    No full text
    International audienceIn patrimonial institutions such as libraries and archives, the valorization of the vast amount of documents thathave been recently digitized is still a challenge. Most of these documents are freely accessible as images but their textual content remains largely unreachable and unknown. Research projects dedicated to specific collection allow creating meta-data or even transcriptions obtained through volunteers or crowdsourcing. But the vast majority of the documents cannot be manually transcribed or indexed: automatic large-scale processes for indexing are needed. The increasing adoption of the International Image Interoperability Framework (IIIF) by the patrimonial institutions is a technological enabler for the development of such services. Images are accessible with a unique protocol across institutions and both images and data can be presented with standard tools. In this paper, we describe an architecture for automatic processing of historical documents owned by different institutions but processed and presented thanks to the IIIF framework. We implemented this architecture and processed a large collection of books of hours with a page classifier trained on an annotated sample. The result is freely distributed and can be viewed with any IIIF compatible viewer

    A comparison of sequential and combined approaches for named entity recognition in a corpus of handwritten medieval charters

    No full text
    International audienceThis paper introduces a new corpus of multilin-gual medieval handwritten charter images, annotated with fulltranscription and named entities. The corpus is used to com-pare two approaches for named entity recognition in historicaldocument images in several languages: on the one hand, asequential approach, more commonly used, that sequentiallyapplies handwritten text recognition (HTR) and named entityrecognition (NER), on the other hand, a combined approachthat simultaneously transcribes the image text line and extractsthe entities. Experiments conducted on the charter corpus inLatin, early new high German and old Czech for name, dateand location recognition demonstrate a superior performance ofthe combined approach
    corecore