Search CORE

110 research outputs found

Grid’5000 Based Large Scale OCR Using the DTW Algorithm: Case of the Arabic Cursive Writing

Author: Maher Khemakhem
Mohamed Jemni
Mohamed Labidi
Publication venue: 'IntechOpen'
Publication date: 01/01/2011
Field of study

International audienc

IntechOpen

Genetic diversity of Mayetiola destructor and Mayetiola hordei (Diptera: Cecidomyiidae) by inter-simple sequence repeats (ISSRs)

Author: Khemakhem Maha Mezghani
Makni Hanem
Markakchi Mohamed
Publication venue: 'African Journals Online (AJOL)'
Publication date: 19/08/2005
Field of study

Inter-simple sequence repeats (ISSR) polymorphism was used to reveal genetic variability and phylogenetic relationships within and between three haplotypes of Mayetiola species. A set of 14 ISSR primers were screened representing di-, tri, tetra and penta-nucleotide repeats out of which 10 generated scorable bands and three were able to distinguish one of three haplotypes. The consensus tree constructed using binary data from banding patterns generated by ISSR-PCR clustered the two Mayetiola species according to their mitochondrial haplotype. Moreover, genetic diversity estimated by the coefficient of variation indicates a high intra and inter-haplotypes polymorphism. Our results indicate that ISSR can be useful as DNA-based molecular markers for studying genetic diversity and phylogenetic relationships of Mayetiola haplotypes.African Journal of Biotechnology Vol. 4 (7), pp. 601-606, 200

AJOL - African Journals Online

Automatic Extraction of TEI Structures in Digitized Lexical Resources using Conditional Random Fields

Author: Foppiano Luca
Khemakhem Mohamed
Romary Laurent
Publication venue: HAL CCSD
Publication date: 19/09/2017
Field of study

International audienceAn important number of digitized lexical resources remain unexploited due to their unstructured content. Manually structuring such resources is a costly task given their multifold complexity. Our goal is to find an approach to automatically structure digitized dictionaries, independently from the language or the lexicographic school or style. In this paper we present a first version of GROBID-Dictionaries1, an open source machine learning system for lexical information extraction.Our approach is twofold: we perform a cascading structure extraction, while we select at each level specific features for training.We followed a ”divide to conquer” strategy to dismantle text constructs in a digitized dictionary, based on the observation of their layout. Main pages (see Figure 1) in almost any dictionary share three blocks: a header (green), a footer (blue) and a body (orange). The body is, in its turn, constituted by several entries (red). Each lexical entry can be further decomposed (see Figure 2) as: form (green), etymology (blue), sense (red) or/and related entry. The same logic could be applied further for each extracted block but in the scope of this paper we focus just on the first three levels.The cascading approach ensures a better understanding of the learning process’s output and consequently simplifies the feature selection process. Limited exclusive text blocks per level helps significantly in diagnosing the cause of prediction errors. It allows an early detection and replacement of irrelevant selected features that can bias a trained model. In such a segmentation, it becomes more straightforward to notice that, for instance, the token position in the page is very relevant to detect headers and footers and has almost no pertinence for capturing a sense in a lexical entry which is very often split on two pages.To implement our approach, we took up the available infrastructure from GROBID [7], a machine learning system for the extraction of bibliographic metadata. GROBID adopts the same cascading approach and uses Conditional Random Fields (CRF) [6] to label text sequences. The output of Grobid dictionary is planned to generate a TEI compliant encoding [2, 9] where the various segmentation levels are associated with an appropriate XML tessellation. Collaboration with COST ENeL are ongoing to ensure maximal compatibility with existing dictionary projects.Our experiments justify so far our choices, where models for the first two levels trained on two different dictionary samples have given a high precision and recall with a small amount of annotated data. Relying mainly on the text layout, we tried to diversify the selected features for each model, on the token and line levels. We are working on tuning features and annotating more data to maintain the good results with new samples and to improve the third segmentation level.While just few task specific attempts [1] have been using machine learning in this research direction, the landscape remains dominated by rule based techniquess [4, 3, 8] which are ad-hoc and costly, even impossible, to adapt for new lexical resources

INRIA a CCSD electronic archive server

Hal-Diderot

TEI Encoding of a Classical Mixtec Dictionary Using GROBID- Dictionaries

Author: Bowers Jack
Khemakhem Mohamed
Romary Laurent
Publication venue: HAL CCSD
Publication date: 01/10/2019
Field of study

International audienceThis paper presents the application of GROBID-Dictionaries (Khemakhem et al. 2017, Khemakhem et al. 2018a, Khemakhem et al. 2018b, Khemakhem et al. 2018c), an open source machine learning system for automatically structuring print dictionaries in digital format into TEI (Text Encoding Initiative) to a historical lexical resource of Colonial Mixtec 'Voces del Dzaha Dzahui' published by the Dominican fray Francisco Alvarado in the year 1593. The GROBID-Dictionaries application was applied to a reorganized and modernized version of the historical resource published by Jansen and Perez Jiménez (2009). The TEI dictionary produced will be integrated into a language documentation project dealing with Mixtepec-Mixtec (ISO 639-3: mix) (Bowers & Romary, 2017, 2018a, 2018b) an under-resourced indigenous language native to the Juxtlahuaca district of Oaxaca Mexico

INRIA a CCSD electronic archive server

Enhancing Usability for Automatically Structuring Digitised Dictionaries

Author: Herold Axel
Khemakhem Mohamed
Romary Laurent
Publication venue: HAL CCSD
Publication date: 08/05/2018
Field of study

International audienceThe last decade has seen a rapid development of the number of NLP tools which have been made available to the community. The usability of several e-lexicography tools represents a serious obstacle for researchers with little or no background in computer science. We present in this paper our efforts to overcome this issue in the case of a machine learning system for the automatic segmentation and semantic annotation of digitised dictionaries. Our approach is based on limiting the burdens of managing the tool's setup in different execution environments and lightening the complexity of the training process. We illustrate the possibility to reach this goal through the adaptation of existing functionalities and through using out of the box software deployment technology. We also report on the community's feedback after exposing the new setup to real users of different professional backgrounds

INRIA a CCSD electronic archive server

Les catalogues et GROBID

Author: Gabay Simon
Khemakhem Mohamed
Romary Laurent
Publication venue: HAL CCSD
Publication date: 15/11/2018
Field of study

Doctora

INRIA a CCSD electronic archive server

Retro-digitizing and Automatically Structuring a Large Bibliography Collection

Author: Khemakhem Mohamed
Lindemann David
Romary Laurent
Publication venue: HAL CCSD
Publication date: 07/12/2018
Field of study

International audienc

INRIA a CCSD electronic archive server

How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Author: Galleron Ioana
Khemakhem Mohamed
Ortiz Suárez Pedro Javier
Romary Laurent
Williams Geoffrey
Publication venue: HAL CCSD
Publication date: 18/09/2019
Field of study

International audienc

Selling autograph manuscripts in 19th c. Paris: digitising the Revue des Autographes

Author: Gabay Simon
Khemakhem Mohamed
Rondeau Du Noyer Lucie
Publication venue: HAL CCSD
Publication date: 15/01/2020
Field of study

International audienceIn Paris, the manuscript market appears in the early 20's of the 19th c. Fixed-price catalogues and auction catalogues are regularly published, describing each document in detail. Such descriptions being highly formalised, it is possible to extract and structure them (almost) automatically, and thus create a database of sold manuscripts in 19th c. Paris.Il mercato dei manoscritti appare a Parigi all'inizio degli anni '20 del XIX se-colo. In questo contesto, mercanti specializzati vendono regolarmente cataloghi d'asta a prezzo fisso che descrivono minuziosamente ogni documento acquistabile. Tale testi hanno delle strutture ricorrenti che è possibile estrarre e strutturarle (quasi) automaticamente, creando così un database di tutti i manoscritti venduti nella Parigi del XIX secolo

INRIA a CCSD electronic archive server

How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Author: Galleron Ioana
Khemakhem Mohamed
Ortiz Suárez Pedro Javier
Romary Laurent
Williams Geoffrey
Publication venue: HAL CCSD
Publication date: 18/09/2019
Field of study

International audienc

INRIA a CCSD electronic archive server