12 research outputs found
Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective
This paper presents a Lisp architecture for a portable NLP system, termed
LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard,
customized and in-house developed NLP tools. Our system facilitates portability
across different institutions and data systems by incorporating an enriched
Common Data Model (CDM) to standardize necessary data elements. It utilizes
UMLS to perform domain adaptation when integrating generic domain NLP tools. It
also features stand-off annotations that are specified by positional reference
to the original document. We built an interval tree based search engine to
efficiently query and retrieve the stand-off annotations by specifying
positional requirements. We also developed a utility to convert an inline
annotation format to stand-off annotations to enable the reuse of clinical text
datasets with inline annotations. We experimented with our system on several
NLP facilitated tasks including computational phenotyping for lymphoma patients
and semantic relation extraction for clinical notes. These experiments
showcased the broader applicability and utility of LAPNLP.Comment: 6 pages, accepted by IEEE BIBM 2018 as regular pape
Conditional Hierarchical Bayesian Tucker Decomposition
Our research focuses on studying and developing methods for reducing the
dimensionality of large datasets, common in biomedical applications. A major
problem when learning information about patients based on genetic sequencing
data is that there are often more feature variables (genetic data) than
observations (patients). This makes direct supervised learning difficult. One
way of reducing the feature space is to use latent Dirichlet allocation in
order to group genetic variants in an unsupervised manner. Latent Dirichlet
allocation is a common model in natural language processing, which describes a
document as a mixture of topics, each with a probability of generating certain
words. This can be generalized as a Bayesian tensor decomposition to account
for multiple feature variables. While we made some progress improving and
modifying these methods, our significant contributions are with hierarchical
topic modeling. We developed distinct methods of incorporating hierarchical
topic modeling, based on nested Chinese restaurant processes and Pachinko
Allocation Machine, into Bayesian tensor decompositions. We apply these models
to predict whether or not patients have autism spectrum disorder based on
genetic sequencing data. We examine a dataset from National Database for Autism
Research consisting of paired siblings -- one with autism, and the other
without -- and counts of their genetic variants. Additionally, we linked the
genes with their Reactome biological pathways. We combine this information into
a tensor of patients, counts of their genetic variants, and the membership of
these genes in pathways. Once we decompose this tensor, we use logistic
regression on the reduced features in order to predict if patients have autism.
We also perform a similar analysis of a dataset of patients with one of four
common types of cancer (breast, lung, prostate, and colorectal).Comment: 20 pages, added model evaluation and log-likelihood section
Automatic classification of registered clinical trials towards the Global Burden of Diseases taxonomy of diseases and injuries
Includes details on the implementation of MetaMap and IntraMap, prioritization rules, the test set of clinical trials and the classification of the external test set according to the 171 GBD categories. Dataset S1: Expert-based enrichment database for the classification according to the 28 GBD categories. Manual classification of 503 UMLS concepts that could not be mapped to any of the 28 GBD categories. Dataset S2: Expert-based enrichment database for the classification according to the 171 GBD categories. Manual classification of 655 UMLS concepts that could not be mapped to any of the 171 GBD categories, among which 108 could be projected to candidate GBD categories. Table S1: Excluded residual GBD categories for the grouping of the GBD cause list in 171 GBD categories. A grouping of 193 GBD categories was defined during the GBD 2010 study to inform policy makers about the main health problems per country. From these 193 GBD categories, we excluded the 22 residual categories listed in the Table. We developed a classifier for the remaining 171 GBD categories. Among these residual categories, the unique excluded categories in the grouping of 28 GBD categories were âOther infectious diseasesâ and âOther endocrine, nutritional, blood, and immune disordersâ. Table S2: Per-category evaluation of performance of the classifier for the 171 GBD categories plus the âNo GBDâ category. Number of trials per GBD category from the test set of 2,763 clinical trials. Sensitivities, specificities (in %) and likelihood ratios for each of the 171 GBD categories plus the âNo GBDâ category for the classifier using the Word Sense Disambiguation server, the expert-based enrichment database and the priority to the health condition field. Table S3: Performance of the 8 versions of the classifier for the 171 GBD categories. Exact-matching and weighted averaged sensitivities and specificities for 8 versions of the classifier for the 171 GBD categories. Exact-matching corresponds to the proportion (in %) of trials for which the automatic GBD classification is correct. Exact-matching was estimated over all trials (Nâ=â2,763), trials concerning a unique GBD category (Nâ=â2,092), trials concerning 2 or more GBD categories (Nâ=â187), and trials not relevant for the GBD (Nâ=â484). The weighted averaged sensitivity and specificity corresponds to the weighted average across GBD categories of the sensitivities and specificities for each GBD category plus the âNo GBDâ category (in %). The 8 versions correspond to the combinations of the use or not of the Word Sense Disambiguation server during the text annotation, the expert-based enrichment database, and the priority to the health condition field as a prioritization rule. Table S4: Per-category evaluation of the performance of the baseline for the 28 GBD categories plus the âNo GBDâ category. Number of trials per GBD category from the test set of 2,763 clinical trials. Sensitivities and specificities (in %) of the 28 GBD categories plus the âNo GBDâ category for the classification of clinical trial records towards GBD categories without using the UMLS knowledge source but based on the recognition in free text of the names of diseases defining in each GBD category only. For the baseline a clinical trial records was classified with a GBD category if at least one of the 291 disease names from the GBD cause list defining that GBD category appeared verbatim in the condition field, the public or scientific titles, separately, or in at least one of these three text fields. (DOCX 84 kb