73 research outputs found
Biomedical Information Extraction: Mining Disease Associated Genes from Literature
Disease associated gene discovery is a critical step to realize the future of personalized medicine. However empirical and clinical validation of disease associated genes are time consuming and expensive. In silico discovery of disease associated genes from literature is therefore becoming the first essential step for biomarker discovery to support hypothesis formulation and decision making. Completion of human genome project and advent of high-throughput technology have produced tremendous amount of data, which results in exponential growing of biomedical knowledge deposited in literature database. The sheer quantity of unexplored information causes information overflow for biomedical researchers, and poses big challenge for informatics researchers to address user's information extraction needs. This thesis focused on mining disease associated genes from PubMed literature database using machine learning and graph theory based information extraction (IE) methods. Mining disease associated genes is not trivial and requires pipelines of information extraction steps and methods. Beginning from named entity recognition (NER), the author introduced semantic concept type into feature space for conditional random fields machine learning and demonstrated the effectiveness of the concept feature for disease NER. The effects of domain specific POS tagging, domain specific dictionaries, and named entity encoding scheme on NER performance were also explored. Experimental results show that by combining knowledge base with concept feature space, it can significantly improve the overall disease NER performance. It has also shown that shallow linguistic features of global and local word sequence context can be used with string kernel based supporting vector machine (SVM) for efficient disease-gene relation extraction. Lastly, the disease-associated gene network was constructed by utilizing concept co-occurrence matrix computed from disease focused document collection, and subjected to systematic topology analysis. The gene network was then merged with a seed-gene expanded network to form heterogeneous disease-gene network. The author identified and prioritized disease-associated genes by graph centrality measurements. This novel approach provides a new mean for disease associated gene extraction from large corpora.Ph.D., Information Studies -- Drexel University, 201
Biomedical Term Extraction: NLP Techniques in Computational Medicine
Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype
Noisy multi-label semi-supervised dimensionality reduction
Noisy labeled data represent a rich source of information that often are
easily accessible and cheap to obtain, but label noise might also have many
negative consequences if not accounted for. How to fully utilize noisy labels
has been studied extensively within the framework of standard supervised
machine learning over a period of several decades. However, very little
research has been conducted on solving the challenge posed by noisy labels in
non-standard settings. This includes situations where only a fraction of the
samples are labeled (semi-supervised) and each high-dimensional sample is
associated with multiple labels. In this work, we present a novel
semi-supervised and multi-label dimensionality reduction method that
effectively utilizes information from both noisy multi-labels and unlabeled
data. With the proposed Noisy multi-label semi-supervised dimensionality
reduction (NMLSDR) method, the noisy multi-labels are denoised and unlabeled
data are labeled simultaneously via a specially designed label propagation
algorithm. NMLSDR then learns a projection matrix for reducing the
dimensionality by maximizing the dependence between the enlarged and denoised
multi-label space and the features in the projected space. Extensive
experiments on synthetic data, benchmark datasets, as well as a real-world case
study, demonstrate the effectiveness of the proposed algorithm and show that it
outperforms state-of-the-art multi-label feature extraction algorithms.Comment: 38 page
Deep generative modelling of the imaged human brain
Human-machine symbiosis is a very promising opportunity for the field of neurology given that the interpretation of the imaged human brain is a trivial feat
for neither entity. However, before machine learning systems can be used in
real world clinical situations, many issues with automated analysis must first be
solved. In this thesis I aim to address what I consider the three biggest hurdles
to the adoption of automated machine learning interpretative systems. For each
issue, I will first elucidate the reader on its importance given the overarching
narratives of both neurology and machine learning, and then showcase my proposed solutions to these issues through the use of deep generative models of the
imaged human brain.
First, I start by addressing what is an uncontroversial and universal sign of intelligence: the ability to extrapolate knowledge to unseen cases. Human neuroradiologists have studied the anatomy of the healthy brain and can therefore,
with some success, identify most pathologies present on an imaged brain, even
without having ever been previously exposed to them. Current discriminative
machine learning systems require vast amounts of labelled data in order to accurately identify diseases. In this first part I provide a generative framework that
permits machine learning models to more efficiently leverage unlabelled data for
better diagnoses with either none or small amounts of labels.
Secondly, I address a major ethical concern in medicine: equitable evaluation
of all patients, regardless of demographics or other identifying characteristics.
This is, unfortunately, something that even human practitioners fail at, making
the matter ever more pressing: unaddressed biases in data will become biases
in the models. To address this concern I suggest a framework through which
a generative model synthesises demographically counterfactual brain imaging
to successfully reduce the proliferation of demographic biases in discriminative
models.
Finally, I tackle the challenge of spatial anatomical inference, a task at the centre
of the field of lesion-deficit mapping, which given brain lesions and associated
cognitive deficits attempts to discover the true functional anatomy of the brain.
I provide a new Bayesian generative framework and implementation that allows
for greatly improved results on this challenge, hopefully, paving part of the road
towards a greater and more complete understanding of the human brain
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
- âĶ