788 research outputs found
Counterfactually Guided Off-policy Transfer in Clinical Settings
Domain shift creates significant challenges for sequential decision making in
healthcare since the target domain may be data-scarce and confounded. In this
paper, we propose a method for off-policy transfer by modeling the underlying
generative process with a causal mechanism. We use informative priors from the
source domain to augment counterfactual trajectories in the target in a
principled manner. We demonstrate how this addresses data-scarcity in the
presence of unobserved confounding. The causal parametrization of our sampling
procedure guarantees that counterfactual quantities can be estimated from
scarce observational target data, maintaining intuitive stability properties.
Policy learning in the target domain is further regularized via the source
policy through KL-divergence. Through evaluation on a simulated sepsis
treatment task, our counterfactual policy transfer procedure significantly
improves the performance of a learned treatment policy when assumptions of
"no-unobserved confounding" are relaxed.Comment: 24 pages (including appendix), 18 figure
Comparing Attributional and Relational Similarity as a Means to Identify Clinically Relevant Drug-gene Relationships
In emerging domains, such as precision oncology, knowledge extracted from explicit assertions may be insufficient to identify relationships of interest. One solution to this problem involves drawing inference on the basis of similarity. Computational methods have been developed to estimate the semantic similarity and relatedness between terms and relationships that are distributed across corpora of literature such as Medline abstracts and other forms of human readable text. Most research on distributional similarity has focused on the notion of attributional similarity, which estimates the similarity between entities based on the contexts in which they occur across a large corpus. A relatively under-researched area concerns relational similarity, in which the similarity between pairs of entities is estimated from the contexts in which these entity pairs occur together. While it seems intuitive that models capturing the structure of the relationships between entities might mediate the identification of biologically important relationships, there is to date no comparison of the relative utility of attributional and relational models for this purpose. In this research, I compare the performance of a range of relational and attributional similarity methods, on the task of identifying drugs that may be therapeutically useful in the context of particular aberrant genes, as identified by a team of human experts. My hypothesis is that relational similarity will be of greater utility than attributional similarity as a means to identify biological relationships that may provide answers to clinical questions, (such as “which drugs INHIBIT gene x”?) in the context of rapidly evolving domains.
My results show that models based on relational similarity outperformed models based on attributional similarity on this task. As the methods explained in this research can be applied to identify any sort of relationship for which cue pairs exist, my results suggest that relational similarity may be a suitable approach to apply to other biomedical problems. Furthermore, I found models based on neural word embeddings (NWE) to be particularly useful for this task, given their higher performance than Random Indexing-based models, and significantly less computational effort needed to create them. NWE methods (such as those produced by the popular word2vec tool) are a relatively recent development in the domain of distributional semantics, and are considered by many as the state-of-the-art when it comes to semantic language modeling. However, their application in identifying biologically important relationships from Medline in general, and specifically, in the domain of precision oncology has not been well studied.
The results of this research can guide the design and implementation of biomedical question answering and other relationship extraction applications for precision medicine, precision oncology and other similar domains, where there is rapid emergence of novel knowledge. The methods developed and evaluated in this project can help NLP applications provide more accurate results by leveraging corpus based methods that are by design scalable and robust
Knowledge-based Biomedical Data Science 2019
Knowledge-based biomedical data science (KBDS) involves the design and
implementation of computer systems that act as if they knew about biomedicine.
Such systems depend on formally represented knowledge in computer systems,
often in the form of knowledge graphs. Here we survey the progress in the last
year in systems that use formally represented knowledge to address data science
problems in both clinical and biological domains, as well as on approaches for
creating knowledge graphs. Major themes include the relationships between
knowledge graphs and machine learning, the use of natural language processing,
and the expansion of knowledge-based approaches to novel domains, such as
Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages
with 3 table
The Ensemble MESH-Term Query Expansion Models Using Multiple LDA Topic Models and ANN Classifiers in Health Information Retrieval
Information retrieval in the health field has several challenges. Health information terminology is difficult for consumers (laypeople) to understand. Formulating a query with professional terms is not easy for consumers because health-related terms are more familiar to health professionals. If health terms related to a query are automatically added, it would help consumers to find relevant information. The proposed query expansion (QE) models show how to expand a query using MeSH (Medical Subject Headings) terms. The documents were represented by MeSH terms (i.e. Bag-of-MeSH), which were included in the full-text articles. And then the MeSH terms were used to generate LDA (Latent Dirichlet Analysis) topic models. A query and the top k retrieved documents were used to find MeSH terms as topic words related to the query.
LDA topic words were filtered by 1) threshold values of topic probability (TP) and word probability (WP) or 2) an ANN (Artificial Neural Network) classifier. Threshold values were effective in an LDA model with a specific number of topics to increase IR performance in terms of infAP (inferred Average Precision) and infNDCG (inferred Normalized Discounted Cumulative Gain), which are common IR metrics for large data collections with incomplete judgments. The top k words were chosen by the word score based on (TP *WP) and retrieved document ranking in an LDA model with specific thresholds. The QE model with specific thresholds for TP and WP showed improved mean infAP and infNDCG scores in an LDA model, comparing with the baseline result. However, the threshold values optimized for a particular LDA model did not perform well in other LDA models with different numbers of topics.
An ANN classifier was employed to overcome the weakness of the QE model depending on LDA thresholds by automatically categorizing MeSH terms (positive/negative/neutral) for QE. ANN classifiers were trained on word features related to the LDA model and collection. Two types of QE models (WSW & PWS) using an LDA model and an ANN classifier were proposed: 1) Word Score Weighting (WSW) where the probability of being a positive/negative/neutral word was used to weight the original word score, and 2) Positive Word Selection (PWS) where positive words were identified by the ANN classifier. Forty WSW models showed better average mean infAP and infNDCG scores than the PWS models when the top 7 words were selected for QE. Both approaches based on a binary ANN classifier were effective in increasing infAP and infNDCG, statistically, significantly, compared with the scores of the baseline run. A 3-class classifier performed worse than the binary classifier.
The proposed ensemble QE models integrated multiple ANN classifiers with multiple LDA models. Ensemble QE models combined multiple WSW/PWS models and one or multiple classifiers. Multiple classifiers were more effective in selecting relevant words for QE than one classifier. In ensemble QE (WSW/PWS) models, the top k words added to the original queries were effective to increase infAP and infNDCG scores. The ensemble QE model (WSW) using three classifiers showed statistically significant improvements for infAP and infNDCG in the mean scores for 30 queries when the top 3 words were added. The ensemble QE model (PWS) using four classifiers showed statistically significant improvements for 30 queries in the mean infAP and infNDCG scores
Scientific document summarization via citation contextualization and scientific discourse
The rapid growth of scientific literature has made it difficult for the
researchers to quickly learn about the developments in their respective fields.
Scientific document summarization addresses this challenge by providing
summaries of the important contributions of scientific papers. We present a
framework for scientific summarization which takes advantage of the citations
and the scientific discourse structure. Citation texts often lack the evidence
and context to support the content of the cited paper and are even sometimes
inaccurate. We first address the problem of inaccuracy of the citation texts by
finding the relevant context from the cited paper. We propose three approaches
for contextualizing citations which are based on query reformulation, word
embeddings, and supervised learning. We then train a model to identify the
discourse facets for each citation. We finally propose a method for summarizing
scientific papers by leveraging the faceted citations and their corresponding
contexts. We evaluate our proposed method on two scientific summarization
datasets in the biomedical and computational linguistics domains. Extensive
evaluation results show that our methods can improve over the state of the art
by large margins.Comment: Preprint. The final publication is available at Springer via
http://dx.doi.org/10.1007/s00799-017-0216-8, International Journal on Digital
Libraries (IJDL) 201
External Validity: From Do-Calculus to Transportability Across Populations
The generalizability of empirical findings to new environments, settings or
populations, often called "external validity," is essential in most scientific
explorations. This paper treats a particular problem of generalizability,
called "transportability," defined as a license to transfer causal effects
learned in experimental studies to a new population, in which only
observational studies can be conducted. We introduce a formal representation
called "selection diagrams" for expressing knowledge about differences and
commonalities between populations of interest and, using this representation,
we reduce questions of transportability to symbolic derivations in the
do-calculus. This reduction yields graph-based procedures for deciding, prior
to observing any data, whether causal effects in the target population can be
inferred from experimental findings in the study population. When the answer is
affirmative, the procedures identify what experimental and observational
findings need be obtained from the two populations, and how they can be
combined to ensure bias-free transport.Comment: Published in at http://dx.doi.org/10.1214/14-STS486 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org). arXiv admin note: text overlap with
arXiv:1312.748
Methods and Techniques for Clinical Text Modeling and Analytics
Nowadays, a large portion of clinical data only exists in free text. The wide adoption of Electronic Health Records (EHRs) has enabled the increases in accessing to clinical documents, which provide challenges and opportunities for clinical Natural Language Processing (NLP) researchers. Given free-text clinical notes as input, an ideal system for clinical text understanding should have the ability to support clinical decisions. At corpus level, the system should recommend similar notes based on disease or patient types, and provide medication recommendation, or any other type of recommendations, based on patients' symptoms and other similar medical cases. At document level, it should return a list of important clinical concepts. Moreover, the system should be able to make diagnostic inferences over clinical concepts and output diagnosis. Unfortunately, current work has not systematically studied this system. This study focuses on developing and applying methods/techniques in different aspects of the system for clinical text understanding, at both corpus and document level. We deal with two major research questions: First, we explore the question of How to model the underlying relationships from clinical notes at corpus level? Documents clustering methods can group clinical notes into meaningful clusters, which can assist physicians and patients to understand medical conditions and diseases from clinical notes. We use Nonnegative Matrix Factorization (NMF) and Multi-view NMF to cluster clinical notes based on extracted medical concepts. The clustering results display latent patterns existed among clinical notes. Our method provides a feasible way to visualize a corpus of clinical documents. Based on extracted concepts, we further build a symptom-medication (Symp-Med) graph to model the Symp-Med relations in clinical notes corpus. We develop two Symp-Med matching algorithms to predict and recommend medications for patients based on their symptoms. Second, we want to solve the question of How to integrate structured knowledge with unstructured text to improve results for Clinical NLP tasks? On the one hand, the unstructured clinical text contains lots of information about medical conditions. On the other hand, structured Knowledge Bases (KBs) are frequently used for supporting clinical NLP tasks. We propose graph-regularized word embedding models to integrate knowledge from both KBs and free text. We evaluate our models on standard datasets and biomedical NLP tasks, and results showed encouraging improvements on both datasets. We further apply the graph-regularized word embedding models and present a novel approach to automatically infer the most probable diagnosis from a given clinical narrative.Ph.D., Information Studies -- Drexel University, 201
Recuperação de informação multimodal em repositórios de imagem médica
The proliferation of digital medical imaging modalities in hospitals and other
diagnostic facilities has created huge repositories of valuable data, often
not fully explored. Moreover, the past few years show a growing trend
of data production. As such, studying new ways to index, process and
retrieve medical images becomes an important subject to be addressed by
the wider community of radiologists, scientists and engineers. Content-based
image retrieval, which encompasses various methods, can exploit the visual
information of a medical imaging archive, and is known to be beneficial to
practitioners and researchers. However, the integration of the latest systems
for medical image retrieval into clinical workflows is still rare, and their
effectiveness still show room for improvement.
This thesis proposes solutions and methods for multimodal information
retrieval, in the context of medical imaging repositories. The major
contributions are a search engine for medical imaging studies supporting
multimodal queries in an extensible archive; a framework for automated
labeling of medical images for content discovery; and an assessment and
proposal of feature learning techniques for concept detection from medical
images, exhibiting greater potential than feature extraction algorithms that
were pertinently used in similar tasks. These contributions, each in their
own dimension, seek to narrow the scientific and technical gap towards
the development and adoption of novel multimodal medical image retrieval
systems, to ultimately become part of the workflows of medical practitioners,
teachers, and researchers in healthcare.A proliferação de modalidades de imagem médica digital, em hospitais,
clínicas e outros centros de diagnóstico, levou à criação de enormes
repositórios de dados, frequentemente não explorados na sua totalidade.
Além disso, os últimos anos revelam, claramente, uma tendência para o
crescimento da produção de dados. Portanto, torna-se importante estudar
novas maneiras de indexar, processar e recuperar imagens médicas, por
parte da comunidade alargada de radiologistas, cientistas e engenheiros. A
recuperação de imagens baseada em conteúdo, que envolve uma grande
variedade de métodos, permite a exploração da informação visual num
arquivo de imagem médica, o que traz benefícios para os médicos e
investigadores. Contudo, a integração destas soluções nos fluxos de trabalho
é ainda rara e a eficácia dos mais recentes sistemas de recuperação de
imagem médica pode ser melhorada.
A presente tese propõe soluções e métodos para recuperação de informação
multimodal, no contexto de repositórios de imagem médica. As contribuições
principais são as seguintes: um motor de pesquisa para estudos de imagem
médica com suporte a pesquisas multimodais num arquivo extensível; uma
estrutura para a anotação automática de imagens; e uma avaliação e
proposta de técnicas de representation learning para deteção automática de
conceitos em imagens médicas, exibindo maior potencial do que as técnicas
de extração de features visuais outrora pertinentes em tarefas semelhantes.
Estas contribuições procuram reduzir as dificuldades técnicas e científicas
para o desenvolvimento e adoção de sistemas modernos de recuperação de
imagem médica multimodal, de modo a que estes façam finalmente parte
das ferramentas típicas dos profissionais, professores e investigadores da área
da saúde.Programa Doutoral em Informátic
A Survey of Deep Active Learning
Active learning (AL) attempts to maximize the performance gain of the model
by marking the fewest samples. Deep learning (DL) is greedy for data and
requires a large amount of data supply to optimize massive parameters, so that
the model learns how to extract high-quality features. In recent years, due to
the rapid development of internet technology, we are in an era of information
torrents and we have massive amounts of data. In this way, DL has aroused
strong interest of researchers and has been rapidly developed. Compared with
DL, researchers have relatively low interest in AL. This is mainly because
before the rise of DL, traditional machine learning requires relatively few
labeled samples. Therefore, early AL is difficult to reflect the value it
deserves. Although DL has made breakthroughs in various fields, most of this
success is due to the publicity of the large number of existing annotation
datasets. However, the acquisition of a large number of high-quality annotated
datasets consumes a lot of manpower, which is not allowed in some fields that
require high expertise, especially in the fields of speech recognition,
information extraction, medical images, etc. Therefore, AL has gradually
received due attention. A natural idea is whether AL can be used to reduce the
cost of sample annotations, while retaining the powerful learning capabilities
of DL. Therefore, deep active learning (DAL) has emerged. Although the related
research has been quite abundant, it lacks a comprehensive survey of DAL. This
article is to fill this gap, we provide a formal classification method for the
existing work, and a comprehensive and systematic overview. In addition, we
also analyzed and summarized the development of DAL from the perspective of
application. Finally, we discussed the confusion and problems in DAL, and gave
some possible development directions for DAL
Living analytics methods for the social web
[no abstract
- …