Search CORE

278 research outputs found

Generating a training corpus for OCR post-correction using encoder-decoder model

Author: D'hondt Eva
Grau Brigitte
Grouin Cyril
Publication venue: HAL CCSD
Publication date: 27/11/2017
Field of study

International audienceIn this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or exter- nal information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short- Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, in- cluding a real-life OCR corpus in the med- ical domain

A SAS Program Combining R Functionalities to Implement Pattern-Mixture Models

Author: Bunouf Pierre
Grouin Jean-Marie
Molenberghs Geert
Thijs Herbert
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/12/2015
Field of study

Pattern-mixture models have gained considerable interest in recent years. Patternmixture modeling allows the analysis of incomplete longitudinal outcomes under a variety of missingness mechanisms. In this manuscript, we describe a SAS program which combines R functionalities to fit pattern-mixture models, considering the cases that missingness mechanisms are at random and not at random. Patterns are defined based on missingness at every time point and parameter estimation is based on a full group-bytime interaction. The program implements a multiple imputation method under so-called identifying restrictions. The code is illustrated using data from a placebo-controlled clinical trial. This manuscript and the program are directed to SAS users with minimal knowledge of the R language

Directory of Open Access Journals

Journal of Statistical Software

Proposal for an Extension of Traditional Named Entitites: from Guidelines to Evaluation, an Overview

Author: Fort Karen
Galibert Olivier
Grouin Cyril
Quintard Ludovic
Rosset Sophie
Zweigenbaum Pierre
Publication venue: HAL CCSD
Publication date: 23/06/2011
Field of study

International audienceWithin the framework of the construction of a fact database, we defined guidelines to extract named entities, using a taxonomy based on an extension of the usual named entities defini- tion. We thus defined new types of entities with broader coverage including substantive- based expressions. These extended named en- tities are hierarchical (with types and compo- nents) and compositional (with recursive type inclusion and metonymy annotation). Human annotators used these guidelines to annotate a 1.3M word broadcast news corpus in French. This article presents the definition and novelty of extended named entity annotation guide- lines, the human annotation of a global corpus and of a mini reference corpus, and the evalu- ation of annotations through the computation of inter-annotator agreement. Finally, we dis- cuss our approach and the computed results, and outline further work

HAL-Paris 13

Hal-Diderot

NLP Community Perspectives on Replicability.

Author: Cohen Kevin B
Fort Karën
Grouin Cyril
Mieskes Margot
Névéol Aurélie
Publication venue: HAL CCSD
Publication date: 01/09/2019
Field of study

International audienceWith recent efforts in drawing attention to the task of replicating and/or reproducing1 results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors, reviewers and community 's side

Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers

Author: Fort Karen
Galibert Olivier
Grouin Cyril
Kahn Juliette
Rosset Sophie
Zweigenbaum Pierre
Publication venue: HAL CCSD
Publication date: 12/07/2012
Field of study

International audienceThis paper compares the reference annotation of structured named entities in two corpora with different origins and properties. It ad- dresses two questions linked to such a comparison. On the one hand, what specific issues were raised by reusing the same annotation scheme on a corpus that differs from the first in terms of media and that predates it by more than a century? On the other hand, what contrasts were observed in the resulting annotations across the two corpora

HAL-Paris 13

Recherche et extraction d'information dans des cas cliniques. Présentation de la campagne d'évaluation DEFT 2019

Author: Claveau Vincent
Grabar Natalia
Grouin Cyril
Hamon Thierry
Publication venue: HAL CCSD
Publication date: 02/07/2019
Field of study

International audienc

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Paris 13

HAL-Rennes 1

A corpus for studying full answer justification

Author: Barbier Vincent
Ferret Olivier
Grappy Arnaud
Grau Brigitte
Grouin Cyril
Moriceau Véronique
Robba Isabelle
Tannier Xavier
Vilnat Anne
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceQuestion answering (QA) systems aim at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. This notion of answer justification can also be useful when developing a QA system in order to give criteria for selecting correct answers. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interested in pinpointing the set of information that allows verifying the correctness of the answer in a candidate passage and the question elements that are missing in this passage. Moreover, the relevant information is often given in texts in a different form from the question form : anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer’s disposal to study them, we decided to build an annotated corpus