6 research outputs found

    Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm

    Get PDF
    Abstract: Background: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer’s output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text. Results: Medpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively. Conclusions: Our evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization

    Biomedical Term Extraction: NLP Techniques in Computational Medicine

    Get PDF
    Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

    Semi-automatische Verschlagwortung zur Integration externer semantischer Inhalte innerhalb einer medizinischen Kooperationsplattform

    Get PDF
    PubMed stellt mit 21 Mio. Aufsatzzitaten eines der umfangreichsten Informationssysteme in Bereich der Medizin. Durch die Verwendung einer einheitlichen Terminologie (Medical Subject Heading - MeSH) bei der Indizierung von PubMed Inhalten kann die Orientierung in solch großen DatenbestĂ€nden optimiert werden. Zwar bietet ein kontrolliertes Vokabular bei der Informationsbeschaffung zahlreiche Vorteile gegenĂŒber einer Freitextsuche doch fĂ€llt Nutzern das Abbilden eines Informationsbedarfs auf die verwendete Terminologie oftmals schwer. In dieser Arbeit wird eine SystemunterstĂŒtzung geschaffen, die den Abbildungsprozess automatisiert indem eine automatische Verschlagwortung textbasierter Inhalte unter Verwendung eines kontrollierten Vokabulars vorgenommen wird. Durch die Verwendung einer einheitliche Terminologie kann so eine konsistente Integration von PubMed Inhalten erreicht werden

    SĂ©quençage d’exomes d’une cohorte de familles caucasiennes simplex dont les patients sont atteints du syndrome d’interruption de la tige hypophysaire

    Full text link
    Le syndrome d’interruption de la tige hypophysaire (PSIS) est un dĂ©sordre rare qui affecte la fonction du systĂšme endocrinien. Jusqu’à nos jours, l’imagerie par la rĂ©sonance magnĂ©tique (IRM) demeure la mĂ©thode la plus entreprise afin d’évaluer in vivo l’anomalie d’organogenĂšse de la tige hypophysaire chez les patients. L’absence de la tige caractĂ©rise une dĂ©ficience permanente en hormone de croissance (GH) pendant que l’étiologie du syndrome demeure inconnue. PSIS se dĂ©finit comme l’hypopituitarisme congĂ©nital et il se caractĂ©rise soit par une ectopique post- hypophysaire, soit par une hypoplasie antĂ©hypophysaire, ou encore par une hypoplasie de la tige hypophysaire. Notre objectif consiste Ă  dĂ©terminer les mutations gĂ©nĂ©tiques partagĂ©es entre les sujets affectĂ©s de l’étude et qui pourraient expliquer les causes du syndrome. Pour y parvenir, nous avons analysĂ© les donnĂ©es de sĂ©quençage d’exomes (WES) provenant de sept familles caucasiennes simplex, une famille d’origine arabique et cinq autres dont la gĂ©nĂ©alogie est incomplĂšte. Ces donnĂ©es ont Ă©tĂ© prĂ©cĂ©demment analysĂ©es, pour le mĂȘme but, par d’autres membres de l’équipe en utilisant le pipeline bio-informatique standard basĂ© sur l’utilisation du logiciel GATK. Nous avons prĂ©fĂ©rentiellement optĂ© pour une nouvelle analyse en utilisant deux diffĂ©rents pipelines bio-informatiques indĂ©pendants, pour ensuite comparer conjointement les rĂ©sultats obtenus. Notre protocole consiste Ă  assembler : d’abord, deux pipelines alternatifs de dĂ©tection de mutations gĂ©nĂ©tiques ponctuelles (SNV). Ils sont composĂ©s d’un logiciel d’alignement de sĂ©quence (Bowtie2) et deux logiciels d’appel de variantes (Freebayes et SAMtools). Ensuite, nous avons assemblĂ© trois pipelines de dĂ©tection de variations du nombre de copies (CNV) gĂ©nomiques composĂ©s communĂ©ment d’un logiciel d’alignement de sĂ©quence (BWA) et trois logiciels d’appel de CNV (CoNIFER, fishingCNV, xHmm). Nos rĂ©sultats nous ont permis d’identifier des mutations candidates additionnelles qui n’ont jamais Ă©tĂ© identifiĂ©es. De plus, notre mĂ©thodologie nous a permis de caractĂ©riser certains rĂ©sultats faux positifs, par consĂ©quent elle pourra nous aider Ă  amĂ©liorer la performance des pipelines de dĂ©tection de variations gĂ©nomiques existantes.Pituitary stalk interruption syndrome (PSIS) is a rare disorder that affects the function of the endocrine system of the affected individuals. The absence of the pituitary stalk, assessed by MRI, characterizes patients with permanent growth hormone deficiency while the etiology of the syndrome remains unknown. PSIS is defined by clinical hypopituitarism together with anatomical findings including a hypoplastic anterior pituitary, ectopic posterior pituitary and reduced or hypoplastic pituitary stalk. We aim to find shared variations (SNP, CNV) among affected patients in coding regions which could explain the origin of the syndrome. We analyzed the exome NGS data from 8 affected French Canadian trio families, with one additional consanguineous Arabic trio family and 5 families with incomplete pedigree. These data were previously analyzed, for the same objective, by other members of the team using a standardize GATK based bioinformatics pipeline. It was desired to reanalyze the complete data set with two other independent pipelines, followed by a comparison of the SNP discovery results. In the present aspect of the study, we built two SNP discovery pipelines, both composed of a different NGS data aligners (Bowtie2) and each composed a different variant caller (Freebayes, SAMtools), then a CNV discovery pipeline which is composed of three different CNV callers (CoNIFER, fishingCNV, xHmm). In addition to the candidate mutations identified in the previous analysis, we identified additional candidate mutations which had not been detected and never been reported. Furthermore, our method helps to discover the sources of variation false discovery which could help to improve existing genomic mutation discovery pipelines

    SĂ©quençage d’exomes d’une cohorte de familles caucasiennes simplex dont les patients sont atteints du syndrome d’interruption de la tige hypophysaire

    Full text link
    Le syndrome d’interruption de la tige hypophysaire (PSIS) est un dĂ©sordre rare qui affecte la fonction du systĂšme endocrinien. Jusqu’à nos jours, l’imagerie par la rĂ©sonance magnĂ©tique (IRM) demeure la mĂ©thode la plus entreprise afin d’évaluer in vivo l’anomalie d’organogenĂšse de la tige hypophysaire chez les patients. L’absence de la tige caractĂ©rise une dĂ©ficience permanente en hormone de croissance (GH) pendant que l’étiologie du syndrome demeure inconnue. PSIS se dĂ©finit comme l’hypopituitarisme congĂ©nital et il se caractĂ©rise soit par une ectopique post- hypophysaire, soit par une hypoplasie antĂ©hypophysaire, ou encore par une hypoplasie de la tige hypophysaire. Notre objectif consiste Ă  dĂ©terminer les mutations gĂ©nĂ©tiques partagĂ©es entre les sujets affectĂ©s de l’étude et qui pourraient expliquer les causes du syndrome. Pour y parvenir, nous avons analysĂ© les donnĂ©es de sĂ©quençage d’exomes (WES) provenant de sept familles caucasiennes simplex, une famille d’origine arabique et cinq autres dont la gĂ©nĂ©alogie est incomplĂšte. Ces donnĂ©es ont Ă©tĂ© prĂ©cĂ©demment analysĂ©es, pour le mĂȘme but, par d’autres membres de l’équipe en utilisant le pipeline bio-informatique standard basĂ© sur l’utilisation du logiciel GATK. Nous avons prĂ©fĂ©rentiellement optĂ© pour une nouvelle analyse en utilisant deux diffĂ©rents pipelines bio-informatiques indĂ©pendants, pour ensuite comparer conjointement les rĂ©sultats obtenus. Notre protocole consiste Ă  assembler : d’abord, deux pipelines alternatifs de dĂ©tection de mutations gĂ©nĂ©tiques ponctuelles (SNV). Ils sont composĂ©s d’un logiciel d’alignement de sĂ©quence (Bowtie2) et deux logiciels d’appel de variantes (Freebayes et SAMtools). Ensuite, nous avons assemblĂ© trois pipelines de dĂ©tection de variations du nombre de copies (CNV) gĂ©nomiques composĂ©s communĂ©ment d’un logiciel d’alignement de sĂ©quence (BWA) et trois logiciels d’appel de CNV (CoNIFER, fishingCNV, xHmm). Nos rĂ©sultats nous ont permis d’identifier des mutations candidates additionnelles qui n’ont jamais Ă©tĂ© identifiĂ©es. De plus, notre mĂ©thodologie nous a permis de caractĂ©riser certains rĂ©sultats faux positifs, par consĂ©quent elle pourra nous aider Ă  amĂ©liorer la performance des pipelines de dĂ©tection de variations gĂ©nomiques existantes.Pituitary stalk interruption syndrome (PSIS) is a rare disorder that affects the function of the endocrine system of the affected individuals. The absence of the pituitary stalk, assessed by MRI, characterizes patients with permanent growth hormone deficiency while the etiology of the syndrome remains unknown. PSIS is defined by clinical hypopituitarism together with anatomical findings including a hypoplastic anterior pituitary, ectopic posterior pituitary and reduced or hypoplastic pituitary stalk. We aim to find shared variations (SNP, CNV) among affected patients in coding regions which could explain the origin of the syndrome. We analyzed the exome NGS data from 8 affected French Canadian trio families, with one additional consanguineous Arabic trio family and 5 families with incomplete pedigree. These data were previously analyzed, for the same objective, by other members of the team using a standardize GATK based bioinformatics pipeline. It was desired to reanalyze the complete data set with two other independent pipelines, followed by a comparison of the SNP discovery results. In the present aspect of the study, we built two SNP discovery pipelines, both composed of a different NGS data aligners (Bowtie2) and each composed a different variant caller (Freebayes, SAMtools), then a CNV discovery pipeline which is composed of three different CNV callers (CoNIFER, fishingCNV, xHmm). In addition to the candidate mutations identified in the previous analysis, we identified additional candidate mutations which had not been detected and never been reported. Furthermore, our method helps to discover the sources of variation false discovery which could help to improve existing genomic mutation discovery pipelines
    corecore