73 research outputs found

    Biomedical Information Extraction: Mining Disease Associated Genes from Literature

    Get PDF
    Disease associated gene discovery is a critical step to realize the future of personalized medicine. However empirical and clinical validation of disease associated genes are time consuming and expensive. In silico discovery of disease associated genes from literature is therefore becoming the first essential step for biomarker discovery to support hypothesis formulation and decision making. Completion of human genome project and advent of high-throughput technology have produced tremendous amount of data, which results in exponential growing of biomedical knowledge deposited in literature database. The sheer quantity of unexplored information causes information overflow for biomedical researchers, and poses big challenge for informatics researchers to address user's information extraction needs. This thesis focused on mining disease associated genes from PubMed literature database using machine learning and graph theory based information extraction (IE) methods. Mining disease associated genes is not trivial and requires pipelines of information extraction steps and methods. Beginning from named entity recognition (NER), the author introduced semantic concept type into feature space for conditional random fields machine learning and demonstrated the effectiveness of the concept feature for disease NER. The effects of domain specific POS tagging, domain specific dictionaries, and named entity encoding scheme on NER performance were also explored. Experimental results show that by combining knowledge base with concept feature space, it can significantly improve the overall disease NER performance. It has also shown that shallow linguistic features of global and local word sequence context can be used with string kernel based supporting vector machine (SVM) for efficient disease-gene relation extraction. Lastly, the disease-associated gene network was constructed by utilizing concept co-occurrence matrix computed from disease focused document collection, and subjected to systematic topology analysis. The gene network was then merged with a seed-gene expanded network to form heterogeneous disease-gene network. The author identified and prioritized disease-associated genes by graph centrality measurements. This novel approach provides a new mean for disease associated gene extraction from large corpora.Ph.D., Information Studies -- Drexel University, 201

    Biomedical Term Extraction: NLP Techniques in Computational Medicine

    Get PDF
    Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

    Noisy multi-label semi-supervised dimensionality reduction

    Get PDF
    Noisy labeled data represent a rich source of information that often are easily accessible and cheap to obtain, but label noise might also have many negative consequences if not accounted for. How to fully utilize noisy labels has been studied extensively within the framework of standard supervised machine learning over a period of several decades. However, very little research has been conducted on solving the challenge posed by noisy labels in non-standard settings. This includes situations where only a fraction of the samples are labeled (semi-supervised) and each high-dimensional sample is associated with multiple labels. In this work, we present a novel semi-supervised and multi-label dimensionality reduction method that effectively utilizes information from both noisy multi-labels and unlabeled data. With the proposed Noisy multi-label semi-supervised dimensionality reduction (NMLSDR) method, the noisy multi-labels are denoised and unlabeled data are labeled simultaneously via a specially designed label propagation algorithm. NMLSDR then learns a projection matrix for reducing the dimensionality by maximizing the dependence between the enlarged and denoised multi-label space and the features in the projected space. Extensive experiments on synthetic data, benchmark datasets, as well as a real-world case study, demonstrate the effectiveness of the proposed algorithm and show that it outperforms state-of-the-art multi-label feature extraction algorithms.Comment: 38 page

    Deep generative modelling of the imaged human brain

    Get PDF
    Human-machine symbiosis is a very promising opportunity for the field of neurology given that the interpretation of the imaged human brain is a trivial feat for neither entity. However, before machine learning systems can be used in real world clinical situations, many issues with automated analysis must first be solved. In this thesis I aim to address what I consider the three biggest hurdles to the adoption of automated machine learning interpretative systems. For each issue, I will first elucidate the reader on its importance given the overarching narratives of both neurology and machine learning, and then showcase my proposed solutions to these issues through the use of deep generative models of the imaged human brain. First, I start by addressing what is an uncontroversial and universal sign of intelligence: the ability to extrapolate knowledge to unseen cases. Human neuroradiologists have studied the anatomy of the healthy brain and can therefore, with some success, identify most pathologies present on an imaged brain, even without having ever been previously exposed to them. Current discriminative machine learning systems require vast amounts of labelled data in order to accurately identify diseases. In this first part I provide a generative framework that permits machine learning models to more efficiently leverage unlabelled data for better diagnoses with either none or small amounts of labels. Secondly, I address a major ethical concern in medicine: equitable evaluation of all patients, regardless of demographics or other identifying characteristics. This is, unfortunately, something that even human practitioners fail at, making the matter ever more pressing: unaddressed biases in data will become biases in the models. To address this concern I suggest a framework through which a generative model synthesises demographically counterfactual brain imaging to successfully reduce the proliferation of demographic biases in discriminative models. Finally, I tackle the challenge of spatial anatomical inference, a task at the centre of the field of lesion-deficit mapping, which given brain lesions and associated cognitive deficits attempts to discover the true functional anatomy of the brain. I provide a new Bayesian generative framework and implementation that allows for greatly improved results on this challenge, hopefully, paving part of the road towards a greater and more complete understanding of the human brain

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research
    • â€Ķ
    corecore