6 research outputs found
Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
Abstract: Background: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizerâs output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text. Results: Medpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively. Conclusions: Our evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization
Biomedical Term Extraction: NLP Techniques in Computational Medicine
Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype
Semi-automatische Verschlagwortung zur Integration externer semantischer Inhalte innerhalb einer medizinischen Kooperationsplattform
PubMed stellt mit 21 Mio. Aufsatzzitaten eines der umfangreichsten Informationssysteme in Bereich der Medizin. Durch die Verwendung einer einheitlichen Terminologie (Medical Subject Heading - MeSH) bei der Indizierung von PubMed Inhalten kann die Orientierung in solch groĂen DatenbestĂ€nden optimiert werden. Zwar bietet ein kontrolliertes Vokabular bei der Informationsbeschaffung zahlreiche Vorteile gegenĂŒber einer Freitextsuche doch fĂ€llt Nutzern das Abbilden eines Informationsbedarfs auf die verwendete Terminologie oftmals schwer. In dieser Arbeit wird eine SystemunterstĂŒtzung geschaffen, die den Abbildungsprozess automatisiert indem eine automatische Verschlagwortung textbasierter Inhalte unter Verwendung eines kontrollierten Vokabulars vorgenommen wird. Durch die Verwendung einer einheitliche Terminologie kann so eine konsistente Integration von PubMed Inhalten erreicht werden
SĂ©quençage dâexomes dâune cohorte de familles caucasiennes simplex dont les patients sont atteints du syndrome dâinterruption de la tige hypophysaire
Le syndrome dâinterruption de la tige hypophysaire (PSIS) est un dĂ©sordre rare qui affecte la
fonction du systĂšme endocrinien. JusquâĂ nos jours, lâimagerie par la rĂ©sonance magnĂ©tique (IRM)
demeure la mĂ©thode la plus entreprise afin dâĂ©valuer in vivo lâanomalie dâorganogenĂšse de la tige
hypophysaire chez les patients. Lâabsence de la tige caractĂ©rise une dĂ©ficience permanente en
hormone de croissance (GH) pendant que lâĂ©tiologie du syndrome demeure inconnue. PSIS se
dĂ©finit comme lâhypopituitarisme congĂ©nital et il se caractĂ©rise soit par une ectopique post-
hypophysaire, soit par une hypoplasie antéhypophysaire, ou encore par une hypoplasie de la tige
hypophysaire. Notre objectif consiste à déterminer les mutations génétiques partagées entre les
sujets affectĂ©s de lâĂ©tude et qui pourraient expliquer les causes du syndrome. Pour y parvenir, nous
avons analysĂ© les donnĂ©es de sĂ©quençage dâexomes (WES) provenant de sept familles
caucasiennes simplex, une famille dâorigine arabique et cinq autres dont la gĂ©nĂ©alogie est
incomplĂšte. Ces donnĂ©es ont Ă©tĂ© prĂ©cĂ©demment analysĂ©es, pour le mĂȘme but, par dâautres
membres de lâĂ©quipe en utilisant le pipeline bio-informatique standard basĂ© sur lâutilisation du
logiciel GATK. Nous avons préférentiellement opté pour une nouvelle analyse en utilisant deux
différents pipelines bio-informatiques indépendants, pour ensuite comparer conjointement les
rĂ©sultats obtenus. Notre protocole consiste Ă assembler : dâabord, deux pipelines alternatifs de
dĂ©tection de mutations gĂ©nĂ©tiques ponctuelles (SNV). Ils sont composĂ©s dâun logiciel
dâalignement de sĂ©quence (Bowtie2) et deux logiciels dâappel de variantes (Freebayes et
SAMtools). Ensuite, nous avons assemblé trois pipelines de détection de variations du nombre de
copies (CNV) gĂ©nomiques composĂ©s communĂ©ment dâun logiciel dâalignement de sĂ©quence
(BWA) et trois logiciels dâappel de CNV (CoNIFER, fishingCNV, xHmm). Nos rĂ©sultats nous ont
permis dâidentifier des mutations candidates additionnelles qui nâont jamais Ă©tĂ© identifiĂ©es. De
plus, notre méthodologie nous a permis de caractériser certains résultats faux positifs, par
conséquent elle pourra nous aider à améliorer la performance des pipelines de détection de
variations génomiques existantes.Pituitary stalk interruption syndrome (PSIS) is a rare disorder that affects the function of the
endocrine system of the affected individuals. The absence of the pituitary stalk, assessed by MRI,
characterizes patients with permanent growth hormone deficiency while the etiology of the
syndrome remains unknown. PSIS is defined by clinical hypopituitarism together with anatomical
findings including a hypoplastic anterior pituitary, ectopic posterior pituitary and reduced or
hypoplastic pituitary stalk. We aim to find shared variations (SNP, CNV) among affected patients
in coding regions which could explain the origin of the syndrome. We analyzed the exome NGS
data from 8 affected French Canadian trio families, with one additional consanguineous Arabic
trio family and 5 families with incomplete pedigree. These data were previously analyzed, for the
same objective, by other members of the team using a standardize GATK based bioinformatics
pipeline. It was desired to reanalyze the complete data set with two other independent pipelines,
followed by a comparison of the SNP discovery results. In the present aspect of the study, we built
two SNP discovery pipelines, both composed of a different NGS data aligners (Bowtie2) and each
composed a different variant caller (Freebayes, SAMtools), then a CNV discovery pipeline which
is composed of three different CNV callers (CoNIFER, fishingCNV, xHmm). In addition to the
candidate mutations identified in the previous analysis, we identified additional candidate
mutations which had not been detected and never been reported. Furthermore, our method helps
to discover the sources of variation false discovery which could help to improve existing genomic
mutation discovery pipelines
SĂ©quençage dâexomes dâune cohorte de familles caucasiennes simplex dont les patients sont atteints du syndrome dâinterruption de la tige hypophysaire
Le syndrome dâinterruption de la tige hypophysaire (PSIS) est un dĂ©sordre rare qui affecte la
fonction du systĂšme endocrinien. JusquâĂ nos jours, lâimagerie par la rĂ©sonance magnĂ©tique (IRM)
demeure la mĂ©thode la plus entreprise afin dâĂ©valuer in vivo lâanomalie dâorganogenĂšse de la tige
hypophysaire chez les patients. Lâabsence de la tige caractĂ©rise une dĂ©ficience permanente en
hormone de croissance (GH) pendant que lâĂ©tiologie du syndrome demeure inconnue. PSIS se
dĂ©finit comme lâhypopituitarisme congĂ©nital et il se caractĂ©rise soit par une ectopique post-
hypophysaire, soit par une hypoplasie antéhypophysaire, ou encore par une hypoplasie de la tige
hypophysaire. Notre objectif consiste à déterminer les mutations génétiques partagées entre les
sujets affectĂ©s de lâĂ©tude et qui pourraient expliquer les causes du syndrome. Pour y parvenir, nous
avons analysĂ© les donnĂ©es de sĂ©quençage dâexomes (WES) provenant de sept familles
caucasiennes simplex, une famille dâorigine arabique et cinq autres dont la gĂ©nĂ©alogie est
incomplĂšte. Ces donnĂ©es ont Ă©tĂ© prĂ©cĂ©demment analysĂ©es, pour le mĂȘme but, par dâautres
membres de lâĂ©quipe en utilisant le pipeline bio-informatique standard basĂ© sur lâutilisation du
logiciel GATK. Nous avons préférentiellement opté pour une nouvelle analyse en utilisant deux
différents pipelines bio-informatiques indépendants, pour ensuite comparer conjointement les
rĂ©sultats obtenus. Notre protocole consiste Ă assembler : dâabord, deux pipelines alternatifs de
dĂ©tection de mutations gĂ©nĂ©tiques ponctuelles (SNV). Ils sont composĂ©s dâun logiciel
dâalignement de sĂ©quence (Bowtie2) et deux logiciels dâappel de variantes (Freebayes et
SAMtools). Ensuite, nous avons assemblé trois pipelines de détection de variations du nombre de
copies (CNV) gĂ©nomiques composĂ©s communĂ©ment dâun logiciel dâalignement de sĂ©quence
(BWA) et trois logiciels dâappel de CNV (CoNIFER, fishingCNV, xHmm). Nos rĂ©sultats nous ont
permis dâidentifier des mutations candidates additionnelles qui nâont jamais Ă©tĂ© identifiĂ©es. De
plus, notre méthodologie nous a permis de caractériser certains résultats faux positifs, par
conséquent elle pourra nous aider à améliorer la performance des pipelines de détection de
variations génomiques existantes.Pituitary stalk interruption syndrome (PSIS) is a rare disorder that affects the function of the
endocrine system of the affected individuals. The absence of the pituitary stalk, assessed by MRI,
characterizes patients with permanent growth hormone deficiency while the etiology of the
syndrome remains unknown. PSIS is defined by clinical hypopituitarism together with anatomical
findings including a hypoplastic anterior pituitary, ectopic posterior pituitary and reduced or
hypoplastic pituitary stalk. We aim to find shared variations (SNP, CNV) among affected patients
in coding regions which could explain the origin of the syndrome. We analyzed the exome NGS
data from 8 affected French Canadian trio families, with one additional consanguineous Arabic
trio family and 5 families with incomplete pedigree. These data were previously analyzed, for the
same objective, by other members of the team using a standardize GATK based bioinformatics
pipeline. It was desired to reanalyze the complete data set with two other independent pipelines,
followed by a comparison of the SNP discovery results. In the present aspect of the study, we built
two SNP discovery pipelines, both composed of a different NGS data aligners (Bowtie2) and each
composed a different variant caller (Freebayes, SAMtools), then a CNV discovery pipeline which
is composed of three different CNV callers (CoNIFER, fishingCNV, xHmm). In addition to the
candidate mutations identified in the previous analysis, we identified additional candidate
mutations which had not been detected and never been reported. Furthermore, our method helps
to discover the sources of variation false discovery which could help to improve existing genomic
mutation discovery pipelines