279 research outputs found

    NetiNeti : Discovery of Scientific Names from Text Using Machine Learning Methods Figure 1

    Get PDF
    Figure 1 demonstrates a series of training experiments with the Naïve Bayes classifier using different neighborhoods for contextual features, different sizes of positive and negative training examples and evaluated the resulting classifiers with our annotated gold standard corpus. The data sets are the results of running NetiNeti on subset of 136 PubMedCentral tagged open access articles and with no stop list.A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information.We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org we present the comparison results of various machine learning algorithms on our annotated corpus. Naïve Bayes and Maximum Entropy with Generalized Iterative Scaling (GIS) parameter estimation are the top two performing algorithms

    Extracting biomedical relations from biomedical literature

    Get PDF
    Tese de mestrado em Bioinformática e Biologia Computacional, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, em 2018A ciência, e em especial o ramo biomédico, testemunham hoje um crescimento de conhecimento a uma taxa que clínicos, cientistas e investigadores têm dificuldade em acompanhar. Factos científicos espalhados por diferentes tipos de publicações, a riqueza de menções etiológicas, mecanismos moleculares, pontos anatómicos e outras terminologias biomédicas que não se encontram uniformes ao longo das várias publicações, para além de outros constrangimentos, encorajaram a aplicação de métodos de text mining ao processo de revisão sistemática. Este trabalho pretende testar o impacto positivo que as ferramentas de text mining juntamente com vocabulários controlados (enquanto forma de organização de conhecimento, para auxílio num posterior momento de recolha de informação) têm no processo de revisão sistemática, através de um sistema capaz de criar um modelo de classificação cujo treino é baseado num vocabulário controlado (MeSH), que pode ser aplicado a uma panóplia de literatura biomédica. Para esse propósito, este projeto divide-se em duas tarefas distintas: a criação de um sistema, constituído por uma ferramenta que pesquisa a base de dados PubMed por artigos científicos e os grava de acordo com etiquetas pré-definidas, e outra ferramenta que classifica um conjunto de artigos; e a análise dos resultados obtidos pelo sistema criado, quando aplicado a dois casos práticos diferentes. O sistema foi avaliado através de uma série de testes, com recurso a datasets cuja classificação era conhecida, permitindo a confirmação dos resultados obtidos. Posteriormente, o sistema foi testado com recurso a dois datasets independentes, manualmente curados por investigadores cuja área de investigação se relaciona com os dados. Esta forma de avaliação atingiu, por exemplo, resultados de precisão cujos valores oscilam entre os 68% e os 81%. Os resultados obtidos dão ênfase ao uso das tecnologias e ferramentas de text mining em conjunto com vocabulários controlados, como é o caso do MeSH, como forma de criação de pesquisas mais complexas e dinâmicas que permitam melhorar os resultados de problemas de classificação, como são aqueles que este trabalho retrata.Science, and the biomedical field especially, is witnessing a growth in knowledge at a rate at which clinicians and researchers struggle to keep up with. Scientific evidence spread across multiple types of scientific publications, the richness of mentions of etiology, molecular mechanisms, anatomical sites, as well as other biomedical terminology that is not uniform across different writings, among other constraints, have encouraged the application of text mining methods in the systematic reviewing process. This work aims to test the positive impact that text mining tools together with controlled vocabularies (as a way of organizing knowledge to aid, at a later time, to collect information) have on the systematic reviewing process, through a system capable of creating a classification model which training is based on a controlled vocabulary (MeSH) that can be applied to a variety of biomedical literature. For that purpose, this project was divided into two distinct tasks: the creation a system, consisting of a tool that searches the PubMed search engine for scientific articles and saves them according to pre-defined labels, and another tool that classifies a set of articles; and the analysis of the results obtained by the created system when applied to two different practical cases. The system was evaluated through a series of tests, using datasets whose classification results were previously known, allowing the confirmation of the obtained results. Afterwards, the system was tested by using two independently-created datasets which were manually curated by researchers working in the field of study. This last form of evaluation achieved, for example, precision scores as low as 68%, and as high as 81%. The results obtained emphasize the use of text mining tools, along with controlled vocabularies, such as MeSH, as a way to create more complex and comprehensive queries to improve the performance scores of classification problems, with which the theme of this work relates

    Concept Based Knowledge Discovery from Biomedical Literature

    Get PDF
    Philosophiae Doctor - PhDThis thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology, resented can be integrated with the researchers own knowledge, experimentation and observations for optimal progression of scientific research.South Afric

    Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications

    Get PDF
    The size of the existing academic literature corpus and the incredible rate of new publications offers a great need and opportunity to harness computational approaches to data and knowledge extraction across all research fields. Elements of this challenge can be met by developments in automation for retrieval of electronic documents, document classification and knowledge extraction. In this thesis, I detail studies of these processes in three related chapters. Although the focus of each chapter is distinct, they contribute to my aim of developing a generalisable pipeline for clinical applications in Natural Language Processing in the academic literature. In chapter one, I describe the development of “Cadmus”, An open-source system developed in Python to generate corpora of biomedical text from the published literature. Cadmus comprises three main steps: Search query & meta-data collection, document retrieval, and parsing of the retrieved text. I present an example of full-text retrieval for a corpus of over two hundred thousand articles using a gene-based search query with quality control metrics for this retrieval process and a high-level illustration of the utility of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate was 85.2% with institutional subscription access and 54.4% without. Chapter Two details developing a custom-built Naïve Bayes supervised machine learning document classifier. This binary classifier is based on calculating the relative enrichment of biomedical terms between two classes of documents in a training set. The classifier is trained and tested upon a manually classified set of over 8000 abstract and full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical applications of automated retrieval, processing, and classification by considering the published literature on Paediatric COVID-19. Case reports and similar articles were classified into “severe” and “non-severe” classes, and term enrichment was evaluated to find biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series analysis was employed to illustrate emerging disease entities like the Multisystem Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through literature-based discovery

    Complex Network Analysis for Scientific Collaboration Prediction and Biological Hypothesis Generation

    Get PDF
    With the rapid development of digitalized literature, more and more knowledge has been discovered by computational approaches. This thesis addresses the problem of link prediction in co-authorship networks and protein--protein interaction networks derived from the literature. These networks (and most other types of networks) are growing over time and we assume that a machine can learn from past link creations by examining the network status at the time of their creation. Our goal is to create a computationally efficient approach to recommend new links for a node in a network (e.g., new collaborations in co-authorship networks and new interactions in protein--protein interaction networks). We consider edges in a network that satisfies certain criteria as training instances for the machine learning algorithms. We analyze the neighborhood structure of each node and derive the topological features. Furthermore, each node has rich semantic information when linked to the literature and can be used to derive semantic features. Using both types of features, we train machine learning models to predict the probability of connection for the new node pairs. We apply our idea of link prediction to two distinct networks: a co-authorship network and a protein--protein interaction network. We demonstrate that the novel features we derive from both the network topology and literature content help improve link prediction accuracy. We also analyze the factors involved in establishing a new link and recurrent connections

    Mining the Medical and Patent Literature to Support Healthcare and Pharmacovigilance

    Get PDF
    Recent advancements in healthcare practices and the increasing use of information technology in the medical domain has lead to the rapid generation of free-text data in forms of scientific articles, e-health records, patents, and document inventories. This has urged the development of sophisticated information retrieval and information extraction technologies. A fundamental requirement for the automatic processing of biomedical text is the identification of information carrying units such as the concepts or named entities. In this context, this work focuses on the identification of medical disorders (such as diseases and adverse effects) which denote an important category of concepts in the medical text. Two methodologies were investigated in this regard and they are dictionary-based and machine learning-based approaches. Futhermore, the capabilities of the concept recognition techniques were systematically exploited to build a semantic search platform for the retrieval of e-health records and patents. The system facilitates conventional text search as well as semantic and ontological searches. Performance of the adapted retrieval platform for e-health records and patents was evaluated within open assessment challenges (i.e. TRECMED and TRECCHEM respectively) wherein the system was best rated in comparison to several other competing information retrieval platforms. Finally, from the medico-pharma perspective, a strategy for the identification of adverse drug events from medical case reports was developed. Qualitative evaluation as well as an expert validation of the developed system's performance showed robust results. In conclusion, this thesis presents approaches for efficient information retrieval and information extraction from various biomedical literature sources in the support of healthcare and pharmacovigilance. The applied strategies have potential to enhance the literature-searches performed by biomedical, healthcare, and patent professionals. The applied strategies have potential to enhance the literature-searches performed by biomedical, healthcare, and patent professionals. This can promote the literature-based knowledge discovery, improve the safety and effectiveness of medical practices, and drive the research and development in medical and healthcare arena

    Using natural language processing techniques to inform research on nanotechnology

    Get PDF
    Literature in the field of nanotechnology is exponentially increasing with more and more engineered nanomaterials being created, characterized, and tested for performance and safety. With the deluge of published data, there is a need for natural language processing approaches to semi-automate the cataloguing of engineered nanomaterials and their associated physico-chemical properties, performance, exposure scenarios, and biological effects. In this paper, we review the different informatics methods that have been applied to patent mining, nanomaterial/device characterization, nanomedicine, and environmental risk assessment. Nine natural language processing (NLP)-based tools were identified: NanoPort, NanoMapper, TechPerceptor, a Text Mining Framework, a Nanodevice Analyzer, a Clinical Trial Document Classifier, Nanotoxicity Searcher, NanoSifter, and NEIMiner. We conclude with recommendations for sharing NLP-related tools through online repositories to broaden participation in nanoinformatics
    corecore