Search CORE

7 research outputs found

L'utilisation des POMDP pour les résumés multi-documents orientés par une thématique

Author: Chali Yllias
Hasan Sadid A.
Mojahid Mustapha
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

National audienceL’objectif principal du résumé multi-documents orienté par une thématique est de générer un résumé à partir de documents sources en réponse à une requête formulée par l’utilisateur. Cette tâche est difficile car il n’existe pas de méthode efficace pour mesurer la satisfaction de l’utilisateur. Cela introduit ainsi une incertitude dans le processus de génération de résumé. Dans cet article, nous proposons une modélisation de l’incertitude en formulant notre système de résumé comme un processus de décision markovien partiellement observables (POMDP) car dans de nombreux domaines on a montré que les POMDP permettent de gérer efficacement les incertitudes. Des expériences approfondies sur les jeux de données du banc d’essai DUC ont démontré l’efficacité de notre approche

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

From GLÀFF to PsychoGLÀFF: a large psycholinguistics-oriented French lexical resource

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceIn this paper, we present two French lexical resources, GLÀFF and PsychoGLÀFF. The former, automatically extracted from the collaborative online dictionary Wiktionary, is a large-scale versatile lexicon exploitable in Natural Language Processing applications and linguistic studies. The latter, based on GLÀFF, is a lexicon specifically designed for psycholinguistic research. GLÀFF, counting more than 1.4 million entries, features an unprecedented size. It reports lemmas, main syntactic categories, inflectional features and phonemic transcriptions. PsychoGLÀFF contains additional information related to formal aspects of the lexicon and its distribution. It contains about 340,000 entries (120,000 lemmas) that are corpora-attested. We explain how the resources have been created and compare them to other known resources in terms of coverage and quality. Regarding PsychoGLÀFF, the comparison shows that it has an exceptionally large repertoire while having a comparable quality

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Tagging Occitan using French and Castillan Tree Tagger

Author: Vergez-Couret Marianne
Publication venue: HAL CCSD
Publication date: 07/12/2013
Field of study

International audiencePart-Of-Speech (POS) tagging, including tokenization and sentence splitting, is the first step in all Natural Language Processing chain. It usually requires substantial efforts to annotate corpora and produce lexicons. However, when these language resources are missing like in Occitan, rather than concentrate the effort in creating them, methods are settled to adapt existing rich-resourced languages tagger. For this to work, these methods exploit the etymologic proximity of the under-resourced language and a rich-resourced language. In this article, we focus on Occitan, which shares similarities with several romance languages including French and Castillan. The method consists in running existing morpho-syntactic tools, here Tree Tagger, on Occitan texts with first a translation of the frequent words in a rich-resourced language. We performed two distinct experimentations, one exploiting similarities between Occitan and French and the second exploiting similarities between Occitan and Castillan. This method only requires the listing of the 300 most frequent words (based on corpus) to construct two bilingual lexicons (Occitan/French and Occitan/Castillan). Our results are better than those obtained with the Apertium tagger using a larger lexicon

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

LITL at CLEF eHealth2016: recognizing entities in French biomedical documents

Author: Grauby Céline
Heu Mby Aurore
Ho-Dac Lydia-Mai
Malosse Justine
Rivière Laura
Tanguy Ludovic
Veltz-Mauclair Amélie
Wauquier Marine
Publication venue: HAL CCSD
Publication date: 01/01/2016
Field of study

International audienceThis paper describes the participation of master's students (LITL programme, university of Toulouse) and their teachers to the CLEF eHealth 2016 campaign. Two runs were submitted for task 2 (multilingual information extraction) which consisted in the recognition and categorization of medical entities in French biomedical documents. The system used consists of a CRF classier based on a number of dierent features (POS tagging, generic word lists and syntactic parsing). In addition , several patterns were used on the CRF's output in order to extract more complex entities. The best run achieved high precision (0.640.78) but lower recall (0.320.40), with an overall F1-measure of 0.430.53

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Énumération et structuration discursive

Author: Péry-Woodley Marie-Paule
Rebeyrolle Josette
Publication venue: 'EDP Sciences'
Publication date: 01/01/2014
Field of study

International audienceDans cet article, la structure énumérative est envisagée, dans une perspective discursive, en tant que procédé d'organisation du texte constituant un tout fonctionnel. Une fois précisée notre approche et explicité son ancrage, un premier objectif est d’illustrer la diversité des réalisations des structures énumératives, tout en dégageant clairement ce qui en fait l’unité : la mise en parallèle des items, l’expression (ou l’inférabilité) du critère interprétatif qui sous-tend cette mise en parallèle. Nous montrons que quelle que soit la réalisation de la structure – indices variés, potentiellement distribués sur les différents composants – elle se doit d’être perceptible, puisque c’est cette perception qui conditionne chez le lecteur la compréhension de l’intention sous-jacente. Nous décrivons les indices et la manière dont ils se combinent pour rendre la structure visible. Nous nous intéressons également à la diversité des contextes où elle s’insère, et des rôles discursifs auxquels elle se prête, ainsi que les premiers résultats concernant les corrélations entre types de réalisation et fonction. Pour cela, nous nous focalisons sur ses « marges » – l’amorce, qui la lie au texte amont et annonce l’énumération, et la clôture, segment final qui fait le lien avec le texte aval – pour mieux mettre en lumière la nécessité de la considérer dans son ensemble comme un tout fonctionnel

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

EDP Sciences OAI-PMH repository (1.2.0)

HAL Descartes

Du terme prédicatif au cadre sémantique : méthodologie de compilation d'une ressource terminologique pour les termes arabes de l'informatique

Author: Ghazzawi Nizar
Publication venue
Publication date: 01/08/2016
Field of study

La description des termes dans les ressources terminologiques traditionnelles se limite à certaines informations, comme le terme (principalement nominal), sa définition et son équivalent dans une langue étrangère. Cette description donne rarement d’autres informations qui peuvent être très utiles pour l’utilisateur, surtout s’il consulte les ressources dans le but d’approfondir ses connaissances dans un domaine de spécialité, maitriser la rédaction professionnelle ou trouver des contextes où le terme recherché est réalisé. Les informations pouvant être utiles dans ce sens comprennent la description de la structure actancielle des termes, des contextes provenant de sources authentiques et l’inclusion d’autres parties du discours comme les verbes. Les verbes et les noms déverbaux, ou les unités terminologiques prédicatives (UTP), souvent ignorés par la terminologie classique, revêtent une grande importance lorsqu’il s’agit d’exprimer une action, un processus ou un évènement. Or, la description de ces unités nécessite un modèle de description terminologique qui rend compte de leurs particularités. Un certain nombre de terminologues (Condamines 1993, Mathieu-Colas 2002, Gross et Mathieu-Colas 2001 et L’Homme 2012, 2015) ont d’ailleurs proposé des modèles de description basés sur différents cadres théoriques. Notre recherche consiste à proposer une méthodologie de description terminologique des UTP de la langue arabe, notamment l’arabe standard moderne (ASM), selon la théorie de la Sémantique des cadres (Frame Semantics) de Fillmore (1976, 1977, 1982, 1985) et son application, le projet FrameNet (Ruppenhofer et al. 2010). Le domaine de spécialité qui nous intéresse est l’informatique. Dans notre recherche, nous nous appuyons sur un corpus recueilli du web et nous nous inspirons d’une ressource terminologique existante, le DiCoInfo (L’Homme 2008), pour compiler notre propre ressource. Nos objectifs se résument comme suit. Premièrement, nous souhaitons jeter les premières bases d’une version en ASM de cette ressource. Cette version a ses propres particularités : 1) nous visons des unités bien spécifiques, à savoir les UTP verbales et déverbales; 2) la méthodologie développée pour la compilation du DiCoInfo original devra être adaptée pour prendre en compte une langue sémitique. Par la suite, nous souhaitons créer une version en cadres de cette ressource, où nous regroupons les UTP dans des cadres sémantiques, en nous inspirant du modèle de FrameNet. À cette ressource, nous ajoutons les UTP anglaises et françaises, puisque cette partie du travail a une portée multilingue. La méthodologie consiste à extraire automatiquement les unités terminologiques verbales et nominales (UTV et UTN), comme Ham~ala (حمل) (télécharger) et taHmiyl (تحميل) (téléchargement). Pour ce faire, nous avons adapté un extracteur automatique existant, TermoStat (Drouin 2004). Ensuite, à l’aide des critères de validation terminologique (L’Homme 2004), nous validons le statut terminologique d’une partie des candidats. Après la validation, nous procédons à la création de fiches terminologiques, à l’aide d’un éditeur XML, pour chaque UTV et UTN retenue. Ces fiches comprennent certains éléments comme la structure actancielle des UTP et jusqu’à vingt contextes annotés. La dernière étape consiste à créer des cadres sémantiques à partir des UTP de l’ASM. Nous associons également des UTP anglaises et françaises en fonction des cadres créés. Cette association a mené à la création d’une ressource terminologique appelée « DiCoInfo : A Framed Version ». Dans cette ressource, les UTP qui partagent les mêmes propriétés sémantiques et structures actancielles sont regroupées dans des cadres sémantiques. Par exemple, le cadre sémantique Product_development regroupe des UTP comme Taw~ara (طور) (développer), to develop et développer. À la suite de ces étapes, nous avons obtenu un total de 106 UTP ASM compilées dans la version en ASM du DiCoInfo et 57 cadres sémantiques associés à ces unités dans la version en cadres du DiCoInfo. Notre recherche montre que l’ASM peut être décrite avec la méthodologie que nous avons mise au point.The description of terms in traditional terminological resources is limited to certain details, such as the term (which is usually a noun), its definition, and its equivalent. This description seldom takes into account other details, which can be of high importance for the users, especially if they consult resources to enhance their knowledge of the domain, to improve professional writing, or to find contexts where the term is realized. The information that might be useful includes the description of the actantial structure of the terms, contexts from authentic resources and the inclusion of other parts of speech such as verbs. Verbs and deverbal nouns, or predicative terminological units (PTUs), which are often ignored by traditional terminology, are of great importance especially for expressing actions, processes or events. But the description of these units requires a model of terminological description that takes into account their special features. Some terminologists (Condamines 1993, Mathieu-Colas 2002, Gross et Mathieu-Colas 2001 et L’Homme 2012, 2015) proposed description models based on different theoretical frameworks. Our research consists of proposing a methodology of terminological description of PTUs of the Arabic language, in particular Modern Standard Arabic (MSA), according to the theory of Frame Semantics of Fillmore (1976, 1977, 1982, 1985) and its application, the FrameNet project (Ruppenhofer et al. 2010). The specialized domain in which we are interested is computing. In our research, we compiled a corpus that we collected from online material and we based our method on an existing online terminological resource called the DiCoInfo (L’Homme 2008) in our pursuit to compile our own. Our objectives are the following. First, we will lay the foundations of an MSA version of the aforementioned resource. This version has its own features: 1) we target specific units, namely verbal and deverbal PTUs; 2) the developed methodology for the compilation of the original DiCoInfo should be adapted to take into account a Semitic language. Afterwards, we will create a framed version of this resource. In this version, we organize the PTUs in semantic frames according to the model of FrameNet. Since this frame version has a multilingual dimension, we add English and French PTUs to the resource. Our methodology consists of automatically extracting the verbal and nominal terminological units (VTUs and NTUs) such as Ham~ala (حمل) (download). To do this, we integrated the MSA to an existing automatic extractor, TermoStat (Drouin 2004). Then, with the help of terminological validation criteria, we validate the terminological status of the candidates. After the validation, we create terminological files with an XML editor for each VTU and NTU. These files contain elements, such as the actantial structure of the PTUs and up to 20 annotated contexts. The last step consists of creating semantic frames from the MSA PTUs. We also associate English and French PTUs to the created frames. This association resulted in the creation of a second terminological resource called “DiCoInfo: A Framed Version”. In this resource, the PTUs that share the same semantic features and actantial structures are organized in semantic frames. For example, the semantic frame Product_development groups PTUs such as Taw~ara (طور) (develop), to develop and développer. As a result of our methodology, we obtained a total of 106 PTUs in MSA compiled in the MSA version of DiCoInfo and 57 semantic frames associated to these units in the framed version. Our research shows that the MSA can be described using the methodology that we set up

Dépôt Institutionnel Numérique