    Segmenting DNA sequence into words based on statistical language model

    This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach to word segmentation’ method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches, just like the English and French/Latin. We present some possible applications of this method at last

    Apport du TAL à la constitution et l'exploitation d'un corpus scolaire de cours préparatoire

    Whereas interest for learner has corpora increased, this research deals with the size of those corpora. Difficulties exist from manual treatments. Therefore we propose to use NLP (Natural Language Processing) methods to help exploit those corpora. This represents a challenge for NLP due to numerous errors from the age level. Our work aims to adapt some verified methods from NLP to build and exploit a first grade elementary school corpus. Our project has two goals in mind. First we hope to construct a framework which can deal with needs in didactic's and linguistic's research. And secondly we aim to model this particular writing type which is far from standard spelling. In this master's thesis we will present a proposition of annotation schema and suggestions for future research.L'intérêt pour l'étude des corpus scolaires, tout en étant grandissant, se heurte à la taille de ces corpus et donc à la difficulté d'une analyse entièrement manuelle. Utiliser des méthodes empruntées au traitement automatique des langues (TAL) pourrait aider à l'exploitation de ces corpus. Cela représente cependant un défi pour le TAL du fait de l'éloignement de ces corpus à la norme. L'objectif de notre travail est d'adapter certaines techniques du TAL, éprouvées par ailleurs, afin de faciliter la constitution et l'exploitation d'un corpus recueilli en classe de CP. L'enjeu est donc double. Il s'agit à la fois de proposer une première définition d'un outil répondant aux besoins de la recherche en linguistique et en didactique. Mais il s'agit également, pour le TAL, de caractériser et de modéliser un type d'écrit distant de la norme. Nous proposerons dans ce mémoire un premier schéma d'annotation d'erreurs et des pistes pour l'analyse automatique de ce type de corpus

    A two-stage statistical word segmentation system for Chinese

    In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and wordformation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.