2,009 research outputs found

    A Silver Standard Corpus of Human Phenotype-Gene Relations

    Full text link
    Human phenotype-gene relations are fundamental to fully understand the origin of some phenotypic abnormalities and their associated diseases. Biomedical literature is the most comprehensive source of these relations, however, we need Relation Extraction tools to automatically recognize them. Most of these tools require an annotated corpus and to the best of our knowledge, there is no corpus available annotated with human phenotype-gene relations. This paper presents the Phenotype-Gene Relations (PGR) corpus, a silver standard corpus of human phenotype and gene annotations and their relations. The corpus consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations. We generated this corpus using Named-Entity Recognition tools, whose results were partially evaluated by eight curators, obtaining a precision of 87.01%. By using the corpus we were able to obtain promising results with two state-of-the-art deep learning tools, namely 78.05% of precision. The PGR corpus was made publicly available to the research community.Comment: NAACL 201

    Extracting phenotype-gene relations from biomedical literature using distant supervision and deep learning

    Get PDF
    Tese de mestrado em Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2019As relações entre fenótipos humanos e genes são fundamentais para entender completamente a origem de algumas abnormalidades fenotípicas e as suas doenças associadas. A literatura biomédica é a fonte mais abrangente dessas relações. Diversas ferramentas de extração de relações têm sido propostas para identificar relações entre conceitos em texto muito heterogéneo ou não estruturado, utilizando algoritmos de supervisão distante e aprendizagem profunda. Porém, a maioria dessas ferramentas requer um corpus anotado e não há nenhum corpus disponível anotado com relações entre fenótipos humanos e genes. Este trabalho apresenta o corpus Phenotype-Gene Relations (PGR), um corpus padrão-prata de anotações de fenótipos humanos e genes e as suas relações (gerado de forma automática) e dois módulos de extração de relações usando um algoritmo de distantly supervised multi-instance learning e um algoritmo de aprendizagem profunda com ontologias biomédicas. O corpus PGR consiste em 1712 resumos de artigos, 5676 anotações de fenótipos humanos, 13835 anotações de genes e 4283 relações. Os resultados do corpus foram parcialmente avaliados por oito curadores, todos investigadores nas áreas de Biologia e Bioquímica, obtendo uma precisão de 87,01%, com um valor de concordância inter-curadores de 87,58%. As abordagens de supervisão distante (ou supervisão fraca) combinam um corpus não anotado com uma base de dados para identificar e extrair entidades do texto, reduzindo a quantidade de esforço necessário para realizar anotações manuais. A distantly supervised multi-instance learning aproveita a supervisão distante e um sparse multi-instance learning algorithm para treinar um classificador de extracção de relações, usando uma base de dados padrão-ouro de relações entre fenótipos humanos e genes. As ferramentas de aprendizagem profunda de extração de relações, para tarefas de prospeção de textos biomédicos, raramente tiram proveito dos recursos específicos existentes para cada domínio, como as ontologias biomédicas. As ontologias biomédicas desempenham um papel fundamental, fornecendo informações semânticas e de ancestralidade sobre uma entidade. Este trabalho utilizou a Human Phenotype Ontology e a Gene Ontology, para representar cada par candidato como a sequência de relações entre os seus ancestrais para cada ontologia. O corpus de teste PGR foi aplicado aos módulos de extração de relações desenvolvidos, obtendo resultados promissores, nomeadamente 55,00% (módulo de aprendizagem profunda) e 73,48% (módulo de distantly supervised multi-instance learning) na medida-F. Este corpus de teste também foi aplicado ao BioBERT, um modelo de representação de linguagem biomédica pré-treinada para prospeção de texto biomédico, obtendo 67,16% em medida-F.Human phenotype-gene relations are fundamental to fully understand the origin of some phenotypic abnormalities and their associated diseases. Biomedical literature is the most comprehensive source of these relations. Several relation extraction tools have been proposed to identify relations between concepts in highly heterogeneous or unstructured text, namely using distant supervision and deep learning algorithms. However, most of these tools require an annotated corpus, and there is no corpus available annotated with human phenotype-gene relations. This work presents the Phenotype-Gene Relations (PGR) corpus, a silver standard corpus of human phenotype and gene annotations and their relations (generated in a fully automated manner), and two relation extraction modules using a distantly supervised multi-instance learning algorithm, and an ontology based deep learning algorithm. The PGR corpus consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations. The corpus results were partially evaluated by eight curators, all working in the fields of Biology and Biochemistry, obtaining a precision of 87.01%, with an inter-curator agreement score of 87.58%. Distant supervision (or weak supervision) approaches combine an unlabeled corpus with a knowledge base to identify and extract entities from text, reducing the amount of manual effort necessary. Distantly supervised multi-instance learning takes advantage of distant supervision and a sparse multi-instance learning algorithm to train a relation extraction classifier, using a gold standard knowledge base of human phenotype-gene relations. Deep learning relation extraction tools, for biomedical text mining tasks, rarely take advantage of existing domain-specific resources, such as biomedical ontologies. Biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. This work used the Human Phenotype Ontology and the Gene Ontology, to represent each candidate pair as the sequence of relations between its ancestors for each ontology. The PGR test-set was applied to the developed relation extraction modules, obtaining promising results, namely 55.00% (deep learning module), and 73.48% (distantly supervised multi-instance learning module) in F-measure. This test-set was also applied to BioBERT, a pre-trained biomedical language representation model for biomedical text mining, obtaining 67.16% in F-measure

    BiOnt: Deep Learning using Multiple Biomedical Ontologies for Relation Extraction

    Full text link
    Successful biomedical relation extraction can provide evidence to researchers and clinicians about possible unknown associations between biomedical entities, advancing the current knowledge we have about those entities and their inherent mechanisms. Most biomedical relation extraction systems do not resort to external sources of knowledge, such as domain-specific ontologies. However, using deep learning methods, along with biomedical ontologies, has been recently shown to effectively advance the biomedical relation extraction field. To perform relation extraction, our deep learning system, BiOnt, employs four types of biomedical ontologies, namely, the Gene Ontology, the Human Phenotype Ontology, the Human Disease Ontology, and the Chemical Entities of Biological Interest, regarding gene-products, phenotypes, diseases, and chemical compounds, respectively. We tested our system with three data sets that represent three different types of relations of biomedical entities. BiOnt achieved, in F-score, an improvement of 4.93 percentage points for drug-drug interactions (DDI corpus), 4.99 percentage points for phenotype-gene relations (PGR corpus), and 2.21 percentage points for chemical-induced disease relations (BC5CDR corpus), relatively to the state-of-the-art. The code supporting this system is available at https://github.com/lasigeBioTM/BiOnt.Comment: ECIR 202

    Using Neural Networks for Relation Extraction from Biomedical Literature

    Full text link
    Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1

    Extracting Negative Biomedical Relations from Literature

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2021The prevalent source for obtaining scientific knowledge remains the scientific literature. Considering that the focus of biomedical research has shifted from individual entities to whole biological systems, understanding the relations between those entities has become paramount for generating knowledge. Relations between entities can either be positive, if there is evidence of an association, or negative, if there is no evidence of an association. To this date, most relation extraction systems focus on extracting positive relations, therefore few knowledge bases contain negative relations. Disregarding negative relations leads to the loss of valuable information that could be used to advance biomedical research. This work presents the Negative Phenotype¬Disease Relations (NPDR) dataset, which describes a subset of negative disease¬phenotype relations from a gold¬standard knowledge base made available by the Human Phenotype Ontology (HPO), and an automatic extraction system developed to automatically annotate the entities and extract the relations from the NPDR dataset. The NPDR dataset was constructed by analysing 177 medical documents and consists of 347 manually annotated at the document¬level relations, from which 222 were inferred from the HPO gold¬standard knowledge base, and 125 were new annotated relations. The main categories of the dataset are the characterization of the entities that participate in the negative relation; the characterization of the sentence that implies the negative relation; and the characterization of the location of the entities and sentences in the article. The automatic extraction system was created to evaluate the impact of the NPDR dataset on the Named-Entity Recognition (NER), Named¬Entity Linking (NEL) and Relation Extraction (RE) text mining tasks. The NER task showed an average of 20.77% more entities annotated when using disease and phenotype synonyms lexica generated from the NPDR dataset, when comparing the number of annotations produced by the OMIM and HPO lexica. The increase in annotated entities also resulted in 15.11% more relations extracted. The RE task performed poorly, with the highest accuracy being 8.84%.Texto livre continua a ser, aos dias de hoje, o principal meio de produção e partilha de conhecimento. Mais concretamente, a literatura biomédica é a principal fonte de conhecimento clínico e biológico para investigadores e clínicos. Porém, à medida que a informação contida em texto livre, correspondente ao número de publicações de artigos científicos aumenta a um ritmo exponencial, torna¬se difícil para os investigadores manterem¬se a par dos desenvolvimentos dos variados domínios científicos. Para além disso, extrair informação textual relevante é uma tarefa laboriosa e morosa para seres humanos, uma vez que a maioria da informação se encontra retida em texto livre não estruturado. Embora esta tarefa possa resultar em erros quando realizada por computadores, só poderá ser alcançada por meio de processos automáticos. Nesse sentido, métodos de prospeção de texto são uma alternativa interessante para reduzir o tempo despendido por especialistas na obtenção de informação relevante, para além de também cobrirem um largo volume de dados provenientes da literatura biomédica. Métodos de prospeção de texto incluem várias tarefas, tais como Named¬Entity Recognition (NER), Named¬Entity Linking (NEL) e Extração de Relações (ER). O NER identifica as entidades mencionadas no texto, o NEL mapeia as entidades reconhecidas a entradas numa base de dados, e o ER identifica relações entre as entidades reconhecidas. Visto que o foco da investigação biomédica mudou de entidades individuais, tais como genes, proteínas ou fármacos, para sistemas biológicos num todo, métodos de ER automáticos tornaram¬se fundamentais para entender relações entre entidades, tais como interações proteína¬proteína, interações fármaco¬fármaco, ou relações gene¬doença. Estas relações podem ser classificadas como negativas, caso haja evidência de não associação entre as entidades, ou positivas, caso haja evidência de associação entre as entidades. ER pode ser efetuada através de múltiplas abordagens que diferem nos métodos que empregam. Essas abordagens podem ser divididas nos seguintes grupos: coocorrência, que é a abordagem mais simples, uma vez que apenas visa a identificação das entidades na mesma frase; baseada em regras, que são definidas manualmente ou automaticamente; e aprendizagem automática, que utiliza corpora biomédica anotada para aplicar supervisão distante. Métodos de supervisão distante podem ainda ser categorizados em feature¬based e kernel¬based. Aos dias de hoje, a maioria dos sistemas de ER não diferenciam entre relações positivas, negativas ou falsas, porém podem¬se salientar algumas excepções, tais como os sistemas Excerbt e BeFree. O primeiro combina análises sintáticas e semânticas com abordagens de regras e aprendizagem automática, e foi adaptado de forma a detetar representações léxicas negadas de itens léxicos (tais como verbos, nomes ou adjetivos) para a anotação do Negatome, uma base de dados de proteínas que não interagem entre si. O segundo sistema utiliza uma combinação de métodos kernelbased, nomeadamente o Shallow Linguistic Kernel e Dependency Kernel. Para a anotação do corpus GAD usando este sistema, também foi treinado um classificador para distinguir entre relações positivas, negativas e falsas entre genes e doenças. Estima¬se que 13.5% das frases de resumos da literatura biomédica possuem expressões negadas. Desconsiderar expressões que poderão, potencialmente, conter relações negativas pode levar à perda de informação valiosa. Porém, a maioria das bases de dados de extrações de relações biomédicas visam apenas recolher relações positivas entre entidades biomédicas. No entanto, exemplos negativos e positivos são igualmente importantes para treinar, afinar e avaliar sistemas de extração de relações. Contudo, uma vez que os exemplos negativos não se encontram tão documentados como os positivos, poucas bases de dados os contêm. Para além disso, a maioria das bases de dados de extração de relações biomédicas não diferencia entre relações falsas, em que duas relações não estão relacionadas, e negativas, em que existe afirmação de não associação entre duas entidades. Adicionalmente, alguns datasets de padrão prata (compostos por dados gerados de forma automática) também contêm relações negativas falsas que são desconhecidas ou não estão documentadas. Logo, a exploração dessas relações é um bom ponto de partida para expandir as bases de dados de relações biomédicas e populá¬las com exemplos negativos corretos. Este trabalho produziu um dataset de anotações de fenótipos e doenças humanas e as suas relações negativas, o datasetNegative Phenotype¬Disease Relations(NPDR), e um módulo de anotação automática de entidades e relações. Para a realização da primeira etapa da criação do dataset NPDR, foi necessário re alizar a recolha dos identificadores PubMed (PMIDs) associados à relações negativas descritas numa base de dados padrão¬ouro, disponibilizada pela Human Phenotype Ontology (HPO). A partir desses PMIDs foi possível extrair artigos completos que foram subsequentemente analisados manualmente. Essa análise consistiu na descrição das entidades que participam na relação negativa, que compreende a análise dos fenótipos, doenças e os seus genes associados; a descrição das frases que sugerem a relação a negativa, que engloba a caracterização do token de negação usado na frase e a coocorrência das entidades; e a descrição da localização das entidades e frases no artigo. O dataset NPDR contem um total de 347 relações anotadas ao nível do documento, das quais 222 foram obtidas a partir da base de dados padrão¬ouro da HPO, e 125 são novas relações. De forma a avaliar o impacto do dataset NPDR na anotação e extração automática de entidades e as suas relações, a partir dos artigos reunidos para o desenvolvimento da criação do dataset, um pipeline que realiza NER, ER e extrai frases de negação foi implementado. NER reconhece fenótipos humanos e doenças, e ER extrai e classifica a relação entre as entidades. De modo a obter os artigos num formato que fosse legível por máquina, dois métodos foram empregues. O primeiro método consistiu em reunir os PMIDs a partir do dataset NPDR, para os converter nos seus identificadores PubMed Central (PMCIDs) correspondentes, de forma a extrair os artigos completos usando a API do PubMed. O segundo método consistiu na conversão dos artigos reunidos para a construção do dataset NPDR em formato PDF para formato de texto, utilizando a ferramenta de extração de texto PDFMiner. A etapa NER foi realizada usando a ferramenta Minimal Name¬Entity Recognizer (MER) para extrair menções de fenótipos, doenças e genes a partir dos artigos. Por fim, utilizando uma abordagem de supervisão distante, a base de dados padrão¬ouro da HPO foi usada para obter as relações obtidas pela ocorrência de fenótipos nas frases que sugerem a relação negativa, e a ocorrência de doenças e genes relacionados presentes no ar tigo. As relações foram marcadas como Conhecida se a relação estivesse descrita na base de dados, ou Desconhecida caso contrário. Para a anotação de fenótipos dois léxicos foram utilizados, um de termos oficiais da HPO, e outro de sinónimos obtidos a partir do dataset NPDR. Para a anotação de doenças e genes, o léxico principal foi obtido a partir da base de dados da Online Mendelian Inheritance in Man (OMIM), e os restantes léxicos foram construídos a partir de sinónimos e abreviaturas de doenças presentes no dataset NPDR. A adição dos léxicos provenientes do dataset NPDR permitiram anotar, em média, mais 20.77% de entidades, comparativamente à anotação de entidades com os léxicos da HPO e OMIM. Este maior número de entidades também se refletiu num aumento de 15.11% de relações anotadas. A tarefa de ER teve um desempenho fraco, sendo que a precisão de relações negativas detetadas foi de 8.84%

    Development of a text mining approach to disease network discovery

    Get PDF
    Scientific literature is one of the major sources of knowledge for systems biology, in the form of papers, patents and other types of written reports. Text mining methods aim at automatically extracting relevant information from the literature. The hypothesis of this thesis was that biological systems could be elucidated by the development of text mining solutions that can automatically extract relevant information from documents. The first objective consisted in developing software components to recognize biomedical entities in text, which is the first step to generate a network about a biological system. To this end, a machine learning solution was developed, which can be trained for specific biological entities using an annotated dataset, obtaining high-quality results. Additionally, a rule-based solution was developed, which can be easily adapted to various types of entities. The second objective consisted in developing an automatic approach to link the recognized entities to a reference knowledge base. A solution based on the PageRank algorithm was developed in order to match the entities to the concepts that most contribute to the overall coherence. The third objective consisted in automatically extracting relations between entities, to generate knowledge graphs about biological systems. Due to the lack of annotated datasets available for this task, distant supervision was employed to train a relation classifier on a corpus of documents and a knowledge base. The applicability of this approach was demonstrated in two case studies: microRNAgene relations for cystic fibrosis, obtaining a network of 27 relations using the abstracts of 51 recently published papers; and cell-cytokine relations for tolerogenic cell therapies, obtaining a network of 647 relations from 3264 abstracts. Through a manual evaluation, the information contained in these networks was determined to be relevant. Additionally, a solution combining deep learning techniques with ontology information was developed, to take advantage of the domain knowledge provided by ontologies. This thesis contributed with several solutions that demonstrate the usefulness of text mining methods to systems biology by extracting domain-specific information from the literature. These solutions make it easier to integrate various areas of research, leading to a better understanding of biological systems

    Improving approximation of domain-focused, corpus-based, lexical semantic relatedness

    Get PDF
    Semantic relatedness is a measure that quantifies the strength of a semantic link between two concepts. Often, it can be efficiently approximated with methods that operate on words, which represent these concepts. Approximating semantic relatedness between texts and concepts represented by these texts is an important part of many text and knowledge processing tasks of crucial importance in many domain-specific scenarios. The problem of most state-of-the-art methods for calculating domain-specific semantic relatedness is their dependence on highly specialized, structured knowledge resources, which makes these methods poorly adaptable for many usage scenarios. On the other hand, the domain knowledge in the fields such as Life Sciences has become more and more accessible, but mostly in its unstructured form - as texts in large document collections, which makes its use more challenging for automated processing. In this dissertation, three new corpus-based methods for approximating domain-specific textual semantic relatedness are presented and evaluated with a set of standard benchmarks focused on the field of biomedicine. Nonetheless, the proposed measures are general enough to be adapted to other domain-focused scenarios. The evaluation involves comparisons with other relevant state-of-the-art measures for calculating semantic relatedness and the results suggest that the methods presented here perform comparably or better than other approaches. Additionally, the dissertation also presents an experiment, in which one of the proposed methods is applied within an ontology matching system, DisMatch. The performance of the system was evaluated externally on a biomedically themed ‘Phenotype’ track of the Ontology Alignment Evaluation Initiative 2016 campaign. The results of the track indicate, that the use distributional semantic relatedness for ontology matching is promising, as the system presented in this thesis did stand out in detecting correct mappings that were not detected by any other systems participating in the track. The work presented in the dissertation indicates an improvement achieved w.r.t. the stat-of-the-art through the domain adapted use of the distributional principle (i.e. the presented methods are corpus-based and do not require additional resources). The ontology matching experiment showcases practical implications of the presented theoretical body of work

    Automated extraction of genes associated with antibiotic resistance from the biomedical literature

    Get PDF
    The detection of bacterial antibiotic resistance phenotypes is important when carrying out clinical decisions for patient treatment. Conventional phenotypic testing involves culturing bacteria which requires a significant amount of time and work. Whole-genome sequencing is emerging as a fast alternative to resistance prediction, by considering the presence/absence of certain genes. A lot of research has focused on determining which bacterial genes cause antibiotic resistance and efforts are being made to consolidate these facts in knowledge bases (KBs). KBs are usually manually curated by domain experts to be of the highest quality. However, this limits the pace at which new facts are added. Automated relation extraction of gene-antibiotic resistance relations from the biomedical literature is one solution that can simplify the curation process. This paper reports on the development of a text mining pipeline that takes in English biomedical abstracts and outputs genes that are predicted to cause resistance to antibiotics. To test the generalisability of this pipeline it was then applied to predict genes associated with Helicobacter pylori antibiotic resistance, that are not present in common antibiotic resistance KBs or publications studying H. pylori. These genes would be candidates for further lab-based antibiotic research and inclusion in these KBs. For relation extraction, state-of-the-art deep learning models were used. These models were trained on a newly developed silver corpus which was generated by distant supervision of abstracts using the facts obtained from KBs. The top performing model was superior to a co-occurrence model, achieving a recall of 95%, a precision of 60% and F1-score of 74% on a manually annotated holdout dataset. To our knowledge, this project was the first attempt at developing a complete text mining pipeline that incorporates deep learning models to extract gene-antibiotic resistance relations from the literature. Additional related data can be found at https://github.com/AndreBrincat/Gene-Antibiotic-Resistance-Relation-Extractio

    Shared genetic variance between obesity and white matter integrity in Mexican Americans.

    Get PDF
    peer reviewedObesity is a chronic metabolic disorder that may also lead to reduced white matter integrity, potentially due to shared genetic risk factors. Genetic correlation analyses were conducted in a large cohort of Mexican American families in San Antonio (N = 761, 58% females, ages 18-81 years; 41.3 +/- 14.5) from the Genetics of Brain Structure and Function Study. Shared genetic variance was calculated between measures of adiposity [(body mass index (BMI; kg/m(2)) and waist circumference (WC; in)] and whole-brain and regional measurements of cerebral white matter integrity (fractional anisotropy). Whole-brain average and regional fractional anisotropy values for 10 major white matter tracts were calculated from high angular resolution diffusion tensor imaging data (DTI; 1.7 x 1.7 x 3 mm; 55 directions). Additive genetic factors explained intersubject variance in BMI (heritability, h (2) = 0.58), WC (h (2) = 0.57), and FA (h (2) = 0.49). FA shared significant portions of genetic variance with BMI in the genu (rhoG = -0.25), body (rhoG = -0.30), and splenium (rhoG = -0.26) of the corpus callosum, internal capsule (rhoG = -0.29), and thalamic radiation (rhoG = -0.31) (all p's = 0.043). The strongest evidence of shared variance was between BMI/WC and FA in the superior fronto-occipital fasciculus (rhoG = -0.39, p = 0.020; rhoG = -0.39, p = 0.030), which highlights region-specific variation in neural correlates of obesity. This may suggest that increase in obesity and reduced white matter integrity share common genetic risk factors
    corecore