Search CORE

97 research outputs found

Towards Comprehensive Computational Representations of Arabic Multiword Expressions

Author: G Francopoulo
G Francopoulo
IA Sag
J Odijk
K Bar
L Wanner
M Butt
M Palmer
MA Attia
T Arts
T Tanabe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

A successful computational treatment of multiword expressions (MWEs) in natural languages leads to a robust NLP system which considers the long-standing problem of language ambiguity caused primarily by this complex linguistic phenomenon. The first step in addressing this challenge is building an extensive reliable MWEs language resource LR with comprehensive computational representations across all linguistic levels. This forms the cornerstone in understanding the heterogeneous linguistic behaviour of MWEs in their various manifestations. This paper presents a detailed framework for computational representations of Arabic MWEs (ArMWEs) across all linguistic levels based on the state-of-the-art lexical mark-up framework (LMF) with the necessary modifications to suit the distinctive properties of Modern Standard Arabic (MSA). This work forms part of a larger project that aims to develop a comprehensive computational lexicon of ArMWEs for NLP and language pedagogy LP (JOMAL project)

Crossref

White Rose Research Online

MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information

Author: Cruz-Lara Samuel
Francopoulo Gil
Romary Laurent
Semar Nasredine
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceThe fast evolution of language technology has produced pressing needs in standardization. The multiplicity of language resources representation levels and the specialization of these representations make difficult the interaction between linguistic resources and components manipulating these resources. In this paper, we describe the MultiLingual Information Framework (MLIF – ISO CD 24616). MLIF is a metamodel which allows the representation and the exchange of multilingual textual information. This generic metamodel is designed to provide a common platform for all the tools developed around the existing multilingual data exchange formats. This is a work in progress within ISO-TC37 in order to define a new ISO standard

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-CEA

Hal-Diderot

HAL-Rennes 1

Multilingual resources for NLP in the Lexical Markup Framework (LMF)

Author: Bel Nuria
Calzolari Nicoletta
Francopoulo Gil
George Monte
Monachini Monica
Pet Mandy
Soria Claudia
Publication venue: Springer
Publication date: 01/01/2008
Field of study

Optimizing the production, maintenance and extension of lexical resources is one the crucial aspects impacting Natural Language Processing (NLP). A second aspect involves optimizing the process leading to their integration in applications. With this respect, we believe that a consensual specification on monolingual, bilingual and multilingual lexicons can be a useful aid for the various NLP actors. Within ISO, one purpose of Lexical Markup Framework (LMF, ISO-24613) is to define a standard for lexicons that covers multilingual lexical data

UPF Digital Repository

PUblication MAnagement

Évaluer SynLex

Author: Falk Ingrid
Francopoulo Gil
Gardent Claire
Publication venue: HAL CCSD
Publication date: 05/06/2007
Field of study

National audienceSYNLEX is a syntactic lexicon extracted semi-automatically from the LADL tables. Like the other syntactic lexicons for French which are both available and usable for NLP (LEFFF, DICOVALENCE), it is incomplete and its recall and precision wrt a gold standard are unknown.We present an approach which goes some way towards adressing these shortcomings. The approach draws on methods used for the automatic acquisition of syntactic lexicons. First, a new syntactic lexicon is acquired from an 82 million words corpus. This lexicon is then used to validate and extend SYNLEX. Finally, the recall and precision of the extended version of SYNLEX is computed based on a gold standard extracted from DICOVALENCE

INRIA a CCSD electronic archive server

A Metadata Schema for the Description ofLanguage Resources (LRs)

Author: Arranz Victoria
Francopoulo Gil
Frontini Francesca
Gavrilidou Maria
Labropoulou Penny
Mapelli Valerie
Monachini Monica
Piperidis Stelios
Publication venue: Asian Federation of Natural Language Proceesing
Publication date
Field of study

This paper presents the metadata schema for describing language resources (LRs) currently under development for the needs of META-SHARE, an open distributed facility for the exchange and sharing of LRs. An essential ingredient in its setup is the existence of formal and standardized LR descriptions, cornerstone of the interoperability layer of any such initiative. The description of LRs is granular and abstractive, combining the taxonomy of LRs with an inventory of a structured set of descriptive elements, of which only a minimal subset is obligatory; the schema additionally proposes recommended and optional elements. Moreover, the schema includes a set of relations catering for the appropriate inter-linking of resources. The current paper presents the main principles and features of the metadata schema, focusing on the description of text corpora and lexical / conceptual resources

PUblication MAnagement

The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing

Author: Gil Francopoulo
Joseph Mariani
Patrick Paroubek
Publication venue: 'Frontiers Media SA'
Publication date: 01/02/2019
Field of study

This paper introduces the NLP4NLP corpus, which contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965–2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. Most of these publications are in English, some are in French, German, or Russian. Some are open access, others have been provided by the publishers. In order to constitute and analyze this corpus several tools have been used or developed. Many of them use Natural Language Processing methods that have been published in the corpus, hence its name. The paper presents the corpus and some findings regarding its content (evolution over time of the number of articles and authors, collaborations between authors, citations between papers and authors), in the context of a global or comparative analysis between sources. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, or publications

HAL-CentraleSupelec

Directory of Open Access Journals

INRIA a CCSD electronic archive server

NLP4NLP+5: The Deep (R)evolution in Speech and Language Processing

Author: Frédéric Vernier
Gil Francopoulo
Joseph Mariani
Patrick Paroubek
Publication venue: 'Frontiers Media SA'
Publication date: 01/07/2022
Field of study

This paper aims at analyzing the changes in the fields of speech and natural language processing over the recent past 5 years (2016–2020). It is in continuation of a series of two papers that we published in 2019 on the analysis of the NLP4NLP corpus, which contained articles published in 34 major conferences and journals in the field of speech and natural language processing, over a period of 50 years (1965–2015), and analyzed with the methods developed in the field of NLP, hence its name. The extended NLP4NLP+5 corpus now covers 55 years, comprising close to 90,000 documents [+30% compared with NLP4NLP: as many articles have been published in the single year 2020 than over the first 25 years (1965–1989)], 67,000 authors (+40%), 590,000 references (+80%), and approximately 380 million words (+40%). These analyses are conducted globally or comparatively among sources and also with the general scientific literature, with a focus on the past 5 years. It concludes in identifying profound changes in research topics as well as in the emergence of a new generation of authors and the appearance of new publications around artificial intelligence, neural networks, machine learning, and word embedding

HAL-CentraleSupelec

Directory of Open Access Journals

PubMed Central

Standards going concrete: from LMF to Morphalou

Author: Francopoulo Gil
Romary Laurent
Salmon-Alt Susanne
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

Application of the ISO standard LMF to the French CNRS lexicon Morphalou. LMF is the ISO standard for NLP lexicons (aka ISO-24613)

INRIA a CCSD electronic archive server

MPG.PuRe

Hal-Diderot

The relevance of standards for research infrastructures

Author: Declerck Thierry
Francopoulo Gil
Monachini Monica
Romary Laurent
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

Importance of standards as an essential aspect for any research infrastructure in the humanities. ISO Data category registry is designed within ISO TC37

CiteSeerX

INRIA a CCSD electronic archive server

MPG.PuRe

Documentation and User Manual of the META-SHARE Metadata Model

Author: Arranz Victoria
Declerck Thierry
Despiri Elina
Francopoulo Gil
Frontini Francesca
Gavrilidou Maria
Labropoulou Penny
Mapelli Val?rie
Monachini Monica
Piperidis Stelios
Publication venue
Publication date
Field of study

The current deliverable presents the META-SHARE metadata schema v1.0, as implemented in the META-SHARE XSD\u27s v1.0 released to (META-NET and PSP partners) in July 2011 for text corpora and lexical/conceptual resources and its supplement for audio corpora, tools and language descriptions (simplified/refactored version) as implemented in November. It is meant to act as a user manual, providing explanations on the model contents for LRs providers and LRs curators that wish to describe their resources in accordance to it. Work on the schema is ongoing and changes/updates to the model are constantly being made; where appropriate, some changes that are already under way are documented in this deliverable

PUblication MAnagement