Search CORE

6 research outputs found

Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm

Author: Barrett Neil
Weber-Jahnke Jens
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract: Background: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer’s output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text. Results: Medpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively. Conclusions: Our evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization

Springer - Publisher Connector

PubMed Central

Carolina Digital Repository

Topics in machine learning for biomedical literature analysis and text retrieval

Author: Islamaj Doğan Rezarta
Yeganova Lana
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Springer - Publisher Connector

PubMed Central

Biomedical Term Extraction: NLP Techniques in Computational Medicine

Author: Campillos Llanos Leonardo
Díaz Julia
Moreno Sandoval Antonio
Redondo Teófilo
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 14/02/2022
Field of study

Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

Re-UNIR

Semi-automatische Verschlagwortung zur Integration externer semantischer Inhalte innerhalb einer medizinischen Kooperationsplattform

Author: Carevic Zeljko
Publication venue: Köln : Hochschulbibliothek der Technischen Hochschule Köln
Publication date: 31/07/2012
Field of study

PubMed stellt mit 21 Mio. Aufsatzzitaten eines der umfangreichsten Informationssysteme in Bereich der Medizin. Durch die Verwendung einer einheitlichen Terminologie (Medical Subject Heading - MeSH) bei der Indizierung von PubMed Inhalten kann die Orientierung in solch großen Datenbeständen optimiert werden. Zwar bietet ein kontrolliertes Vokabular bei der Informationsbeschaffung zahlreiche Vorteile gegenüber einer Freitextsuche doch fällt Nutzern das Abbilden eines Informationsbedarfs auf die verwendete Terminologie oftmals schwer. In dieser Arbeit wird eine Systemunterstützung geschaffen, die den Abbildungsprozess automatisiert indem eine automatische Verschlagwortung textbasierter Inhalte unter Verwendung eines kontrollierten Vokabulars vorgenommen wird. Durch die Verwendung einer einheitliche Terminologie kann so eine konsistente Integration von PubMed Inhalten erreicht werden

ePublications

Séquençage d’exomes d’une cohorte de familles caucasiennes simplex dont les patients sont atteints du syndrome d’interruption de la tige hypophysaire

Author: Jean-Louis Martineau
Publication venue
Publication date: 01/04/2017
Field of study

Le syndrome d’interruption de la tige hypophysaire (PSIS) est un désordre rare qui affecte la fonction du système endocrinien. Jusqu’à nos jours, l’imagerie par la résonance magnétique (IRM) demeure la méthode la plus entreprise afin d’évaluer in vivo l’anomalie d’organogenèse de la tige hypophysaire chez les patients. L’absence de la tige caractérise une déficience permanente en hormone de croissance (GH) pendant que l’étiologie du syndrome demeure inconnue. PSIS se définit comme l’hypopituitarisme congénital et il se caractérise soit par une ectopique post- hypophysaire, soit par une hypoplasie antéhypophysaire, ou encore par une hypoplasie de la tige hypophysaire. Notre objectif consiste à déterminer les mutations génétiques partagées entre les sujets affectés de l’étude et qui pourraient expliquer les causes du syndrome. Pour y parvenir, nous avons analysé les données de séquençage d’exomes (WES) provenant de sept familles caucasiennes simplex, une famille d’origine arabique et cinq autres dont la généalogie est incomplète. Ces données ont été précédemment analysées, pour le même but, par d’autres membres de l’équipe en utilisant le pipeline bio-informatique standard basé sur l’utilisation du logiciel GATK. Nous avons préférentiellement opté pour une nouvelle analyse en utilisant deux différents pipelines bio-informatiques indépendants, pour ensuite comparer conjointement les résultats obtenus. Notre protocole consiste à assembler : d’abord, deux pipelines alternatifs de détection de mutations génétiques ponctuelles (SNV). Ils sont composés d’un logiciel d’alignement de séquence (Bowtie2) et deux logiciels d’appel de variantes (Freebayes et SAMtools). Ensuite, nous avons assemblé trois pipelines de détection de variations du nombre de copies (CNV) génomiques composés communément d’un logiciel d’alignement de séquence (BWA) et trois logiciels d’appel de CNV (CoNIFER, fishingCNV, xHmm). Nos résultats nous ont permis d’identifier des mutations candidates additionnelles qui n’ont jamais été identifiées. De plus, notre méthodologie nous a permis de caractériser certains résultats faux positifs, par conséquent elle pourra nous aider à améliorer la performance des pipelines de détection de variations génomiques existantes.Pituitary stalk interruption syndrome (PSIS) is a rare disorder that affects the function of the endocrine system of the affected individuals. The absence of the pituitary stalk, assessed by MRI, characterizes patients with permanent growth hormone deficiency while the etiology of the syndrome remains unknown. PSIS is defined by clinical hypopituitarism together with anatomical findings including a hypoplastic anterior pituitary, ectopic posterior pituitary and reduced or hypoplastic pituitary stalk. We aim to find shared variations (SNP, CNV) among affected patients in coding regions which could explain the origin of the syndrome. We analyzed the exome NGS data from 8 affected French Canadian trio families, with one additional consanguineous Arabic trio family and 5 families with incomplete pedigree. These data were previously analyzed, for the same objective, by other members of the team using a standardize GATK based bioinformatics pipeline. It was desired to reanalyze the complete data set with two other independent pipelines, followed by a comparison of the SNP discovery results. In the present aspect of the study, we built two SNP discovery pipelines, both composed of a different NGS data aligners (Bowtie2) and each composed a different variant caller (Freebayes, SAMtools), then a CNV discovery pipeline which is composed of three different CNV callers (CoNIFER, fishingCNV, xHmm). In addition to the candidate mutations identified in the previous analysis, we identified additional candidate mutations which had not been detected and never been reported. Furthermore, our method helps to discover the sources of variation false discovery which could help to improve existing genomic mutation discovery pipelines

Dépôt Institutionnel Numérique

Séquençage d’exomes d’une cohorte de familles caucasiennes simplex dont les patients sont atteints du syndrome d’interruption de la tige hypophysaire

Author: Bal y Gay Jesús, 1905-1993
Carballeira Johan, 1902-1937
Castelao 1886-1950
Fernández Mato Ramón, 1889-1980
Otero Pedrayo Ramón, 1888-1976
Paz-Andrade Valentín, 1898-1987
Valladares Núñez Marcial, 1821-1903
Publication venue
Publication date: 01/01/1973
Field of study

Dépôt Institutionnel Numérique

Galiciana