Search CORE

519 research outputs found

Le lexique-grammaire est-il exploitable pour le traitement des langues ?

Author: Laporte Eric
Publication venue: Presses universitaires de Louvain
Publication date: 01/01/2010
Field of study

The Lexicon-Grammar of French is a dictionary with structured syntactic-semantic information. In order to assess its exploitability in language processing, we survey four criteria: readability, degree of formalisation, degree of validity of information content, and richness in information. We contribute concrete examples to inform this discussion. We compare the significance of the criteria, in order to evaluate the validity of the priorities retained and of the compromises adopted in the course of the construction of the Lexicon-Grammar.Le lexique-grammaire du français est un dictionnaire contenant des informations syntaxico-sémantiques structurées. Pour évaluer son exploitabilité dans le traitement des langues, nous passons en revue quatre critères : sa lisibilité, son degré de formalisation, le degré de validité de son contenu informatif, et sa richesse en informations. Nous proposons des exemples concrets susceptibles d'éclairer le débat sur cette question. Nous pesons l'importance de ces critères, afin d'évaluer la validité des priorités retenues et des compromis adoptés tout au long de l'élaboration du lexique-grammaire

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Acquisition de connaissances lexicales à partir de corpus : la sous-catégorisation verbale en français

Author: Gábor Kata
Messiant Cédric
Poibeau Thierry
Publication venue: ATALA (Association pour le Traitement Automatique des Langues)
Publication date: 15/11/2010
Field of study

National audienceCet article traite de l'acquisition automatique de schémas de sous-catégorisation de verbes en français et de classification automatique de verbes

INRIA a CCSD electronic archive server

HAL-Paris 13

GLÀFF, un Gros Lexique À tout Faire du Français

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 17/06/2013
Field of study

International audienceThis paper introduces GLÀFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLÀFF contains, for each entry, a morphosyntactic description and a phonetic transcription. It distinguishes itself from the other available lexicons mainly by its size, its potential for constant updating and its copylefted license that makes it available for use, modification and redistribution. We explain how we have built GLÀFF and compare it to other known resources. We show that its size and quality are strong assets that could allow GLÀFF to become a reference lexicon for NLP, linguistics and psycholinguistics.Cet article présente GLÀFF, un lexique du français à large couverture extrait du Wiktionnaire, le dictionnaire collaboratif en ligne. GLÀFF contient pour chaque entrée une description morphosyntaxique et une transcription phonémique. Il se distingue des autres lexiques existants principalement par sa taille, sa licence libre et la possibilité de le faire évoluer de façon constante. Nous décrivons ici comment nous l'avons construit, puis caractérisé en le comparant à différentes ressources connues. Cette comparaison montre que sa taille et sa qualité font de GLÀFF un candidat sérieux comme nouvelle ressource standard pour le TAL, la linguistique et la psycholinguistique

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Traduction automatique statistique et adaptation à un domaine spécialisé

Author: LANGLAIS Philippe
LEFEVRE Philippe
LINARES Georges
RUBINO Raphaël
Publication venue
Publication date: 01/01/2011
Field of study

Nous avons observé depuis plusieurs années l émergence des approches statistiques pour la traduction automatique. Cependant, l efficacité des modèles construits est soumise aux variabilités inhérentes au langage naturel. Des études ont montré la présence de vocabulaires spécifique et général composant les corpus de textes de domaines spécialisés. Cette particularité peut être prise en charge par des ressources terminologiques comme les lexiques bilingues.Toutefois, nous pensons que si le vocabulaire est différent entre des textes spécialisés ou génériques, le contenu sémantique et la structure syntaxique peuvent aussi varier. Dans nos travaux,nous considérons la tâche d adaptation aux domaines spécialisés pour la traduction automatique statistique selon deux axes majeurs : l acquisition de lexiques bilingues et l édition a posteriori de traductions issues de systèmes automatiques. Nous évaluons l efficacité des approches proposées dans un contexte spécialisé : le domaine médical. Nos résultats sont comparés aux travaux précédents concernant cette tâche. De manière générale, la qualité des traductions issues de systèmes automatiques pour le domaine médical est améliorée par nos propositions. Des évaluations en oracle tendent à montrer qu il existe une marge de progression importanteThese last years have seen the development of statistical approaches for machine translation. Nevertheless, the intrinsic variations of the natural language act upon the quality of statistical models. Studies have shown that in-domain corpora containwords that can occur in out-of-domain corpora (common words), but also contain domain specific words. This particularity can be handled by terminological resources like bilingual lexicons. However, if the vocabulary differs between out and in-domain data, the syntactic and semantic content may also vary. In our work, we consider the task of domain adaptation for statistical machine translation through two majoraxes : bilingual lexicon acquisition and post-edition of machine translation outputs.We evaluate our approaches on the medical domain. The quality of automatic translations in the medical domain are improved and the results are compared to other works in this field. Oracle evaluations tend to show that further gains are still possibleAVIGNON-Bib. numérique (840079901) / SudocSudocFranceF

OpenGrey Repository

Ortolang

Author: Pierrel Jean-Marie
Publication venue: 'OpenEdition'
Publication date: 02/05/2017
Field of study

Dans cet article nous présentons l'infrastructure Equipex Ortolang (Open Resources and Tools for LANGuage / Outils et Ressources pour un Traitement Optimisé de la LANGue : www.ortolang.fr) en cours de mise en place dans le cadre du Programme d'Investissements d'Avenir (PIA) lancé par le gouvernement français.S'appuyant entre autres sur l'existant des centres de ressources CNRTL (Centre National de Ressources Textuelles et Lexicales : www.cnrtl.fr) et SLDR (Speech and Language Data Repository : http://sldr.org/), cette infrastructure a pour objectif d'assurer la gestion, la mutualisation, la diffusion et la pérennisation de ressources linguistiques de type corpus, dictionnaires, lexiques et outils de traitement de la langue, avec une focalisation particulière sur le français et les langues de France.Après avoir rappelé les motivations d'un tel projet, son originalité et son caractère novateur, nous présenterons les principales caractéristiques d'Ortolang, ses objectifs et ses missions, l'infrastructure logicielle et matérielle de la plateforme puis les moyens mis en œuvre, avant de conclure en indiquant comment suivre et contribuer au projet.This paper presents the infrastructure for the Equipex Ortolang (Open Resources and Tools for LANGuage / Outils et Ressources pour un Traitement Optimisé de la LANGue : www.ortolang.fr) which is currently being developed as part of the French government's Investments for the Future programme.Drawing on existing resources such as the CNRTL (Centre National de Ressources Textuelles et Lexicales: www.cnrtl.fr) and SLDR (Speech and Language Data Repository: http://sldr.org/), the infrastructure is designed for the long-term management, sharing and dissemination of linguistic resources including corpora, dictionaries, lexicons and language processing tools, with a particular focus on French and other languages in France.The paper briefly presents the rationale behind such an original and ground-breaking project, then describes the main characteristics and goals of Ortolang, the platform hardware and software as well as the means available, before concluding with planned future developments and an invitation to contribute to the project

OpenEdition

Vers la création d'un Verbnet du français

Author: Danlos Laurence
Nakamura Takuya
Pradet Quentin
Publication venue: HAL CCSD
Publication date: 01/07/2014
Field of study

International audienceVerbNet est une ressource lexicale pour les verbes anglais qui est bien utile pour le TAL grâce à sa large couverture et sa classification cohérente. Une telle ressource n'existe pas pour le français malgré quelques tentatives. Nous montrons comment adapter semi-automatiquement VerbNet en utilisant deux ressources lexicales existantes, le LVF (Les Verbes Français) et le LG (Lexique-Grammaire). Abstract. VerbNet is an English lexical resource that has proven useful for NLP due to its high coverage and coherent classification. Such a resource doesn't exist for French, despite some (mostly automatic and unsupervised) at-tempts. We show how to semi-automatically adapt VerbNet using existing lexical resources, namely LVF (Les Verbes Français) and LG (Lexique-Grammaire). Mots-clés : VerbNet, cadres de sous-catégorisations, rôles sémantiques

Hal-Diderot

Valoriser le patrimoine documentaire des entreprises par le prisme des métiers

Author: Djambian Caroline
Publication venue: Gresec UGA
Publication date: 05/04/2011
Field of study

International audienceThe documentary heritage of firms has often been accumulated without that they could adapt to the pace of ICT developments. Collective memory that never stops being produced sees its mass growing, becomes scattered and heterogeneous, and many companies today face transverse problems struggle to mobilize their knowledge operationally. We present here the specific case of the Nuclear Engineering Division of EDF France and the need to upgrade its information heritage. We explain why a job of contextualization is essential these cases, to locate the operating mode of the information system in a largest structural problem, technical aspects having to be quickly overwhelmed in order to consider the organization as a whole. In this context, where micro and macro issues mingle, the core business of the company emerges as the basis for any reflection. Documentation produced and used, vehicles the technical knowledge of the company, which is expressed by specific business core's concepts. Their terminology is the key to knowledge enhancement and a better management of the documentary heritage through which they pass. Through the example of the DIN, we present our approach resolutely empirical and qualitative, to evolve the system to a knowledge base centered on the "business core meaning" of the company.Le patrimoine documentaire des entreprises s'est souvent accumulé sans que ces dernières puissent s'adapter au rythme des évolutions des TIC. La mémoire collective qui ne cesse d'être produite voit sa masse croître, est devenue éparse et hétérogène et nombre d'entreprises aujourd'hui confrontées à des problématiques transverses ont du mal à mobiliser leurs connaissances de façon opérationnelle. Nous présentons ici le cas de la Division Ingénierie Nucléaire (DIN) d'EDF et la nécessité de valoriser son patrimoine informationnel. Nous exposons pourquoi un travail amont de contextualisation est essentiel dans des cas comme celui-ci, afin de situer le mode de fonctionnement du système d'information dans une problématique structurelle, les aspects techniques devant être rapidement dépassés pour prendre en compte l'organisation dans sa globalité. Dans ce contexte où problématiques micro et macro se confondent, les métiers cœurs de l'entreprise s'imposent comme la base de toute réflexion. La documentation qu'ils produisent et utilisent véhicule les connaissances techniques de l'entreprise, qui y sont exprimées par des concepts propres aux métiers. Leur terminologie est la clé permettant de valoriser les connaissances et de mieux gérer le patrimoine documentaire par lequel elles transitent. A travers l'exemple de la DIN, nous présentons notre approche résolument empirique et qualitative, pour faire évoluer le système existant vers une base de connaissances centrée sur le " sens métier " de l'organisation. Mots-clés : patrimoine documentaire, document technique, gestion des connaissances, base de connaissances, ontologie, terminologie

HAL AMU

Quelques expériences de TAL sur le discours radiophonique : le cas de la revue de presse

Author: Bilhaut Frédérik
Jackiewicz Agata
Publication venue: 'OpenEdition'
Publication date: 01/01/2016
Field of study

Nous présentons une série d’expériences linguistico-informatiques appliquées aux revues de presse de France Inter (716 textes, de mai 2005 - juin 2011). Le corpus a fait l’objet d’une annotation sémantique automatique sur différents axes : sources et relais d’informations (type de publication, périodicité, chroniqueurs, etc.), contenus factuels (entités, faits, marqueurs thématiques, etc.), discours rapporté, et marques de subjectivité traduisant différentes attitudes, notamment émotionnelles (enthousiasme, inquiétude, etc.) ou axiologiques (accord, validité, etc.). L’étude se décompose en trois volets : (i) analyse de corpus et construction d’une grille d’analyse ; (ii) constitution de ressources linguistiques opérationnelles ; (iii) mise en œuvre informatique et analyse des résultats.We present an experiment in computational linguistics applied to press reviews issued by France Inter (716 texts, May 2005 - June 2011). The corpus has been automatically annotated following various formal and semantic criteria: sources and information channels (kind of publication, periodicity, columnists, etc.), factual contents (named entities, facts, topic markers, etc.), quotes, and subjective aspects related to various attitudes such as emotional ones (enthusiasm, anxiety, etc.) or axiological ones (agreement, validity, etc.). The study is divided into three parts: (i) corpus analysis and building of the analytical framework; (ii) establishment of operational language resources; (iii) implementation and analysis of results

OpenEdition

Collecte orientée sur le Web pour la recherche d'information spécialisée

Author: DE GROC Clément
TANNIER Xavier
ZWEIGENBAUM Pierre
Publication venue
Publication date: 01/01/2013
Field of study

Les moteurs de recherche verticaux, qui se concentrent sur des segments spécifiques du Web, deviennent aujourd'hui de plus en plus présents dans le paysage d'Internet. Les moteurs de recherche thématiques, notamment, peuvent obtenir de très bonnes performances en limitant le corpus indexé à un thème connu. Les ambiguïtés de la langue sont alors d'autant plus contrôlables que le domaine est bien ciblé. De plus, la connaissance des objets et de leurs propriétés rend possible le développement de techniques d'analyse spécifiques afin d'extraire des informations pertinentes.Dans le cadre de cette thèse, nous nous intéressons plus précisément à la procédure de collecte de documents thématiques à partir du Web pour alimenter un moteur de recherche thématique. La procédure de collecte peut être réalisée en s'appuyant sur un moteur de recherche généraliste existant (recherche orientée) ou en parcourant les hyperliens entre les pages Web (exploration orientée).Nous étudions tout d'abord la recherche orientée. Dans ce contexte, l'approche classique consiste à combiner des mot-clés du domaine d'intérêt, à les soumettre à un moteur de recherche et à télécharger les meilleurs résultats retournés par ce dernier.Après avoir évalué empiriquement cette approche sur 340 thèmes issus de l'OpenDirectory, nous proposons de l'améliorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requêtes thématiques plus pertinentes pour le thème afin d'augmenter la précision de la collecte. Nous définissons une métrique fondée sur un graphe de cooccurrences et un algorithme de marche aléatoire, dans le but de prédire la pertinence d'une requête thématique. En aval du moteur de recherche, nous proposons de filtrer les documents téléchargés afin d'améliorer la qualité du corpus produit. Pour ce faire, nous modélisons la procédure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche aléatoire biaisé afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette thèse, nous nous focalisons sur l'exploration orientée du Web. Au coeur de tout robot d'exploration orientée se trouve une stratégie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un thème, tout en minimisant le nombre de pages visitées qui ne sont pas en rapport avec le thème. En pratique, cette stratégie définit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement indépendante du thème à partir de données existantes annotées automatiquement.Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

OpenGrey Repository

: Application au Parc national des calanques de Marseille Cassis La Ciotat

Author: DEBOUDT PHILIPPE
Fan Siqi
Fraisse Amel
Kergosien Eric
Publication venue: HAL CCSD
Publication date: 21/03/2018
Field of study

International audienceThis paper presents the objectives, methodology and initial results of an interdisciplinary research project (geography, information and communication sciences) based on the site of the Calanques National Park. This project is founded by the LabEx DRIIHM-CNRS and the OHM Littoral méditerranéen. To this end, we present a semi-automatic methodology to identify and analyse descriptors related to the territory of the Calanques National Park from Twitter social network.À partir du terrain constitué par le Parc national des Calanques, cette communication présente les objectifs, la méthodologie et les premiers résultats d'un projet de recherche interdisci-plinaire (géographie, sciences de l'information, informatique) soutenu par le LabEx DRII-HM-CNRS et OHM Littoral méditerranéen. La méthodologie semi-automatisée que nous présentons vise à identifier et analyser les thématiques mentionnées et les acteurs qui s'ex-priment sur le territoire d'études à partir de Twitter