10 research outputs found

    Pattern Mining for Named Entity Recognition

    Get PDF
    International audienceMany evaluation campaigns have shown that knowledge-based and data-driven approaches remain equally competitive for Named Entity Recognition. Our re-search team has developed CasEN, a symbolic system based on finite state tran-ducers, which achieved promising results during the Ester2 French-speaking eval-uation campaign. Despite these encouraging results, manually extending the cov-erage of such a hand-crafted system is a difficult task. In this paper, we present a novel approach based on pattern mining for NER and to supplement our sys-tem's knowledge base. The system, mXS, exhaustively searches for hierarchical sequential patterns, that aim at detecting Named Entity boundaries. We assess their efficiency by using such patterns in a standalone mode and in combination with our existing system

    Similarité entre textes basées sur les noms propres

    No full text
    Résumé: Les noms propres représentent environ 10% du texte d'un article de journal. Leur quantité et leur qualité informationnelle sont déjà utilisée dans les systèmes d'extraction d'informations (conférences MUC) Nous avons crée un outil basé sur une description linguistique sous forme de transducteurs à nombre finis d'états. Les noms propres extraits sont alors utilisés dans le but de recherche d'information : il s'agit de présenter aux utilisateurs des textes journalistiques sous la forme d'une hiérarchie et de fournir une description des sujets traités dans les textes. Dans cet article nous présentons une mesure de similarité automatique de textes avec une similarité avec les mots seuls Mots clés : Similarité/ Classification hiérarchique/ Noms propres. Similarites between proper namer besed texts Abstract: Proper naner represent about 10% newspaper articles in English or French texts. Thier quantity and informational qualité are already usen in different Information Extraction systems. Proper names have widely been studied in the MUC confrences designed to promote research in Information Extraction. We have created our own named entity extraction tool based on a linguistic description with automata. The extracted names are used in an iformation retrieval a topic description of the clusters. We verify the interest of the use of proper names in a similarity measure to improve cluster the interest of the use of propre names in a similarity measure to improve clustering. This measure merge a similarity besed on all the words with a similarity based on the propre names. Key words : Similarity/ Hierarchic clustering/ Proper names. Revue d'Information Scientifique & Technique Vol.12(2) 2002: 61-7

    ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy

    No full text

    Building a Dictionary of Anthroponyms

    No full text

    Using Information Extraction to Build a Directory of Conference Announcements

    No full text

    Finite-State Transducer Cascade to Extract Proper Names in Texts

    No full text

    Descriptional Complexity of Iterated Uniform Finite-State Transducers

    No full text
    We introduce the deterministic computational model of an iterated uniform finite-state transducer (IUFST). A IUFST performs the same length-preserving transduction on several left-to-right sweeps. The first sweep takes place on the input string, while any other sweep processes the output of the previous one. The IUFST accepts or rejects upon halting in an accepting or rejecting state along its sweeps. First, we focus on constant sweep bounded IUFSTs. We study their descriptional power vs. deterministic finite automata, and the state cost of implementing language operations. Then, we focus on non-constant sweep bounded IUFSTs, showing a nonregular language hierarchy depending on sweep complexity. The hardness of some classical decision problems on constant sweep bounded IUFSTs is also investigated

    Iterated Uniform Finite-State Transducers: Descriptional Complexity of Nondeterminism and Two-Way Motion

    No full text
    An iterated uniform finite-state transducer executes the same length-preserving transduction in iterative sweeps. The first sweep occurs on the input string, while any subsequent sweep works on the output of the previous one. We consider devices with one-way motion and two-way motion, i.e., sweeps are either from left to right only, or alternate from left to right and from right to left. In addition, devices may work deterministically or nondeterministically. Here, we restrict to study devices performing a constant number of sweeps, which are known to characterize exactly the regular languages. We determine the descriptional costs of removing two-way motion, nondeterminism, and sweeps, and, in particular, the costs for the conversion to deterministic or nondeterministic finite automata. Finally, the special case of unary languages is investigated, and a language family is presented that is immune to the resources of nondeterminism and two-way motion, in the sense that both resources can neither reduce the number of states nor the number of sweeps
    corecore