151,868 research outputs found
DAPPER: Scaling Dynamic Author Persona Topic Model to Billion Word Corpora
Extracting common narratives from multi-author dynamic text corpora requires
complex models, such as the Dynamic Author Persona (DAP) topic model. However,
such models are complex and can struggle to scale to large corpora, often
because of challenging non-conjugate terms. To overcome such challenges, in
this paper we adapt new ideas in approximate inference to the DAP model,
resulting in the DAP Performed Exceedingly Rapidly (DAPPER) topic model.
Specifically, we develop Conjugate-Computation Variational Inference (CVI)
based variational Expectation-Maximization (EM) for learning the model,
yielding fast, closed form updates for each document, replacing iterative
optimization in earlier work. Our results show significant improvements in
model fit and training time without needing to compromise the model's temporal
structure or the application of Regularized Variation Inference (RVI). We
demonstrate the scalability and effectiveness of the DAPPER model by extracting
health journeys from the CaringBridge corpus --- a collection of 9 million
journals written by 200,000 authors during health crises.Comment: Published in IEEE International Conference on Data Mining, November
2018, Singapor
When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, -consistency and Neuroscience Applications
Many studies in biomedical and health sciences involve small sample sizes due
to logistic or financial constraints. Often, identifying weak (but
scientifically interesting) associations between a set of predictors and a
response necessitates pooling datasets from multiple diverse labs or groups.
While there is a rich literature in statistical machine learning to address
distributional shifts and inference in multi-site datasets, it is less clear
such pooling is guaranteed to help (and when it does not) --
independent of the inference algorithms we use. In this paper, we present a
hypothesis test to answer this question, both for classical and high
dimensional linear regression. We precisely identify regimes where pooling
datasets across multiple sites is sensible, and how such policy decisions can
be made via simple checks executable on each site before any data transfer ever
happens. With a focus on Alzheimer's disease studies, we present empirical
results showing that in regimes suggested by our analysis, pooling a local
dataset with data from an international study improves power.Comment: 34th International Conference on Machine Learnin
CheXbreak: Misclassification Identification for Deep Learning Models Interpreting Chest X-rays
A major obstacle to the integration of deep learning models for chest x-ray
interpretation into clinical settings is the lack of understanding of their
failure modes. In this work, we first investigate whether there are patient
subgroups that chest x-ray models are likely to misclassify. We find that
patient age and the radiographic finding of lung lesion, pneumothorax or
support devices are statistically relevant features for predicting
misclassification for some chest x-ray models. Second, we develop
misclassification predictors on chest x-ray models using their outputs and
clinical features. We find that our best performing misclassification
identifier achieves an AUROC close to 0.9 for most diseases. Third, employing
our misclassification identifiers, we develop a corrective algorithm to
selectively flip model predictions that have high likelihood of
misclassification at inference time. We observe F1 improvement on the
prediction of Consolidation (0.008 [95% CI 0.005, 0.010]) and Edema (0.003,
[95% CI 0.001, 0.006]). By carrying out our investigation on ten distinct and
high-performing chest x-ray models, we are able to derive insights across model
architectures and offer a generalizable framework applicable to other medical
imaging tasks.Comment: In Proceedings of the 2021 Conference on Machine Learning for Health
Care, 2021. In ACM Conference on Health, Inference, and Learning (ACM-CHIL)
Workshop 202
Enabling Counterfactual Survival Analysis with Balanced Representations
Balanced representation learning methods have been applied successfully to
counterfactual inference from observational data. However, approaches that
account for survival outcomes are relatively limited. Survival data are
frequently encountered across diverse medical applications, i.e., drug
development, risk profiling, and clinical trials, and such data are also
relevant in fields like manufacturing (e.g., for equipment monitoring). When
the outcome of interest is a time-to-event, special precautions for handling
censored events need to be taken, as ignoring censored outcomes may lead to
biased estimates. We propose a theoretically grounded unified framework for
counterfactual inference applicable to survival outcomes. Further, we formulate
a nonparametric hazard ratio metric for evaluating average and individualized
treatment effects. Experimental results on real-world and semi-synthetic
datasets, the latter of which we introduce, demonstrate that the proposed
approach significantly outperforms competitive alternatives in both
survival-outcome prediction and treatment-effect estimation.Comment: Accepted at ACM Conference on Health, Inference, and Learning (ACM
CHIL 2021). Code at
https://github.com/paidamoyo/counterfactual_survival_analysi
Integrating Physiological Time Series and Clinical Notes with Deep Learning for Improved ICU Mortality Prediction
Intensive Care Unit Electronic Health Records (ICU EHRs) store multimodal
data about patients including clinical notes, sparse and irregularly sampled
physiological time series, lab results, and more. To date, most methods
designed to learn predictive models from ICU EHR data have focused on a single
modality. In this paper, we leverage the recently proposed
interpolation-prediction deep learning architecture(Shukla and Marlin 2019) as
a basis for exploring how physiological time series data and clinical notes can
be integrated into a unified mortality prediction model. We study both early
and late fusion approaches and demonstrate how the relative predictive value of
clinical text and physiological data change over time. Our results show that a
late fusion approach can provide a statistically significant improvement in
mortality prediction performance over using individual modalities in isolation.Comment: Presented at ACM Conference on Health, Inference and Learning
(Workshop Track), 202
Generative Adversarial Networks for Failure Prediction
Prognostics and Health Management (PHM) is an emerging engineering discipline
which is concerned with the analysis and prediction of equipment health and
performance. One of the key challenges in PHM is to accurately predict
impending failures in the equipment. In recent years, solutions for failure
prediction have evolved from building complex physical models to the use of
machine learning algorithms that leverage the data generated by the equipment.
However, failure prediction problems pose a set of unique challenges that make
direct application of traditional classification and prediction algorithms
impractical. These challenges include the highly imbalanced training data, the
extremely high cost of collecting more failure samples, and the complexity of
the failure patterns. Traditional oversampling techniques will not be able to
capture such complexity and accordingly result in overfitting the training
data. This paper addresses these challenges by proposing a novel algorithm for
failure prediction using Generative Adversarial Networks (GAN-FP). GAN-FP first
utilizes two GAN networks to simultaneously generate training samples and build
an inference network that can be used to predict failures for new samples.
GAN-FP first adopts an infoGAN to generate realistic failure and non-failure
samples, and initialize the weights of the first few layers of the inference
network. The inference network is then tuned by optimizing a weighted loss
objective using only real failure and non-failure samples. The inference
network is further tuned using a second GAN whose purpose is to guarantee the
consistency between the generated samples and corresponding labels. GAN-FP can
be used for other imbalanced classification problems as well.Comment: ECML PKDD 2019 (The European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases, 2019
DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction
Clinical trials are essential for drug development but often suffer from
expensive, inaccurate and insufficient patient recruitment. The core problem of
patient-trial matching is to find qualified patients for a trial, where patient
information is stored in electronic health records (EHR) while trial
eligibility criteria (EC) are described in text documents available on the web.
How to represent longitudinal patient EHR? How to extract complex logical rules
from EC? Most existing works rely on manual rule-based extraction, which is
time consuming and inflexible for complex inference. To address these
challenges, we proposed DeepEnroll, a cross-modal inference learning model to
jointly encode enrollment criteria (text) and patients records (tabular data)
into a shared latent space for matching inference. DeepEnroll applies a
pre-trained Bidirectional Encoder Representations from Transformers(BERT) model
to encode clinical trial information into sentence embedding. And uses a
hierarchical embedding model to represent patient longitudinal EHR. In
addition, DeepEnroll is augmented by a numerical information embedding and
entailment module to reason over numerical information in both EC and EHR.
These encoders are trained jointly to optimize patient-trial matching score. We
evaluated DeepEnroll on the trial-patient matching task with demonstrated on
real world datasets. DeepEnroll outperformed the best baseline by up to 12.4%
in average F1.Comment: accepted by The World Wide Web Conference 202
CheXseen: Unseen Disease Detection for Deep Learning Interpretation of Chest X-rays
We systematically evaluate the performance of deep learning models in the
presence of diseases not labeled for or present during training. First, we
evaluate whether deep learning models trained on a subset of diseases (seen
diseases) can detect the presence of any one of a larger set of diseases. We
find that models tend to falsely classify diseases outside of the subset
(unseen diseases) as "no disease". Second, we evaluate whether models trained
on seen diseases can detect seen diseases when co-occurring with diseases
outside the subset (unseen diseases). We find that models are still able to
detect seen diseases even when co-occurring with unseen diseases. Third, we
evaluate whether feature representations learned by models may be used to
detect the presence of unseen diseases given a small labeled set of unseen
diseases. We find that the penultimate layer of the deep neural network
provides useful features for unseen disease detection. Our results can inform
the safe clinical deployment of deep learning models trained on a
non-exhaustive set of disease classes.Comment: Accepted to ACM Conference on Health, Inference, and Learning
(ACM-CHIL) Workshop 202
Reading Industrial Inspection Sheets by Inferring Visual Relations
The traditional mode of recording faults in heavy factory equipment has been
via hand marked inspection sheets, wherein a machine engineer manually marks
the faulty machine regions on a paper outline of the machine. Over the years,
millions of such inspection sheets have been recorded and the data within these
sheets has remained inaccessible. However, with industries going digital and
waking up to the potential value of fault data for machine health monitoring,
there is an increased impetus towards digitization of these hand marked
inspection records. To target this digitization, we propose a novel visual
pipeline combining state of the art deep learning models, with domain knowledge
and low level vision techniques, followed by inference of visual relationships.
Our framework is robust to the presence of both static and non-static
background in the document, variability in the machine template diagrams,
unstructured shape of graphical objects to be identified and variability in the
strokes of handwritten text. The proposed pipeline incorporates a capsule and
spatial transformer network based classifier for accurate text reading, and a
customized CTPN network for text detection in addition to hybrid techniques for
arrow detection and dialogue cloud removal. We have tested our approach on a
real world dataset of 50 inspection sheets for large containers and boilers.
The results are visually appealing and the pipeline achieved an accuracy of
87.1% for text detection and 94.6% for text reading.Comment: Published in 3rd International Workshop on Robust Reading at Asian
Conference on Computer Vision 201
A large-scale Twitter dataset for drug safety applications mined from publicly existing resources
With the increase in popularity of deep learning models for natural language
processing (NLP) tasks, in the field of Pharmacovigilance, more specifically
for the identification of Adverse Drug Reactions (ADRs), there is an inherent
need for large-scale social-media datasets aimed at such tasks. With most
researchers allocating large amounts of time to crawl Twitter or buying
expensive pre-curated datasets, then manually annotating by humans, these
approaches do not scale well as more and more data keeps flowing in Twitter. In
this work we re-purpose a publicly available archived dataset of more than 9.4
billion Tweets with the objective of creating a very large dataset of drug
usage-related tweets. Using existing manually curated datasets from the
literature, we then validate our filtered tweets for relevance using machine
learning methods, with the end result of a publicly available dataset of
1,181,993 million tweets for public use. We provide all code and detailed
procedure on how to extract this dataset and the selected tweet ids for
researchers to use.Comment: 8 tables, 2 figures, 7 pages, accepted after peer review as a
workshop paper in ACM Conference on Health, Inference, and Learning (CHIL)
2020 https://www.chilconference.org/agenda
- …