22,955 research outputs found
Intrinsically Dynamic Network Communities
Community finding algorithms for networks have recently been extended to
dynamic data. Most of these recent methods aim at exhibiting community
partitions from successive graph snapshots and thereafter connecting or
smoothing these partitions using clever time-dependent features and sampling
techniques. These approaches are nonetheless achieving longitudinal rather than
dynamic community detection. We assume that communities are fundamentally
defined by the repetition of interactions among a set of nodes over time.
According to this definition, analyzing the data by considering successive
snapshots induces a significant loss of information: we suggest that it blurs
essentially dynamic phenomena - such as communities based on repeated
inter-temporal interactions, nodes switching from a community to another across
time, or the possibility that a community survives while its members are being
integrally replaced over a longer time period. We propose a formalism which
aims at tackling this issue in the context of time-directed datasets (such as
citation networks), and present several illustrations on both empirical and
synthetic dynamic networks. We eventually introduce intrinsically dynamic
metrics to qualify temporal community structure and emphasize their possible
role as an estimator of the quality of the community detection - taking into
account the fact that various empirical contexts may call for distinct
`community' definitions and detection criteria.Comment: 27 pages, 11 figure
Social media mining for identification and exploration of health-related information from pregnant women
Widespread use of social media has led to the generation of substantial
amounts of information about individuals, including health-related information.
Social media provides the opportunity to study health-related information about
selected population groups who may be of interest for a particular study. In
this paper, we explore the possibility of utilizing social media to perform
targeted data collection and analysis from a particular population group --
pregnant women. We hypothesize that we can use social media to identify cohorts
of pregnant women and follow them over time to analyze crucial health-related
information. To identify potentially pregnant women, we employ simple
rule-based searches that attempt to detect pregnancy announcements with
moderate precision. To further filter out false positives and noise, we employ
a supervised classifier using a small number of hand-annotated data. We then
collect their posts over time to create longitudinal health timelines and
attempt to divide the timelines into different pregnancy trimesters. Finally,
we assess the usefulness of the timelines by performing a preliminary analysis
to estimate drug intake patterns of our cohort at different trimesters. Our
rule-based cohort identification technique collected 53,820 users over thirty
months from Twitter. Our pregnancy announcement classification technique
achieved an F-measure of 0.81 for the pregnancy class, resulting in 34,895 user
timelines. Analysis of the timelines revealed that pertinent health-related
information, such as drug-intake and adverse reactions can be mined from the
data. Our approach to using user timelines in this fashion has produced very
encouraging results and can be employed for other important tasks where
cohorts, for which health-related information may not be available from other
sources, are required to be followed over time to derive population-based
estimates.Comment: 9 page
Machine learning of structured and unstructured healthcare data
The widespread adoption of Electronic Health Records (EHR) systems in healthcare institutions in the United States makes machine learning based on large-scale and real-world clinical data feasible and affordable. Machine learning of healthcare data, or healthcare data analytics, has achieved numerous successes in various applications. However, there are still many challenges for machine learning of healthcare data both structured and unstructured. Longitudinal structured clinical data (e.g., lab test results, diagnoses, and medications) have an enormous variety of categories, are collected at irregularly spaced visits, and are sparsely distributed. Studies on analyzing longitudinal structured EHR data for tasks such as disease prediction and visualization are still limited. For unstructured clinical notes, existing studies mostly focus on disease prediction or cohort selection. Studies on mining clinical notes with the direct purpose to reduce costs for healthcare providers or institutions are limited. To fill in these gaps, this dissertation has three research topics.The first topic is about developing state-of-the-art predictive models to detect diabetic retinopathy using longitudinal structured EHR data. Major deep-learning-based temporal models for disease prediction are studied, implemented, and evaluated. Experimental results on a large-scale dataset show that temporal deep learning models outperform non-temporal random forests models in terms of AUPRC and recall.The second topic is about clustering temporal disease networks to visualize comorbidity progression. We propose a clustering technique to outline comorbidity progression phases as well as a new disease clustering method to simplify the visualization. Two case studies on Clostridioides difficile and stroke show the methods are effective.The third topic is clinical information extraction for medical billing. We propose a framework that consists of two methods, a rule-based and a deep-learning-based, to extract patient history information directly from clinical notes to facilitate the Evaluation and Management Services (E/M) billing. Initial results of the two prototype systems on an annotated dataset are promising and direct us for potential improvements
Overcoming data scarcity of Twitter: using tweets as bootstrap with application to autism-related topic content analysis
Notwithstanding recent work which has demonstrated the potential of using
Twitter messages for content-specific data mining and analysis, the depth of
such analysis is inherently limited by the scarcity of data imposed by the 140
character tweet limit. In this paper we describe a novel approach for targeted
knowledge exploration which uses tweet content analysis as a preliminary step.
This step is used to bootstrap more sophisticated data collection from directly
related but much richer content sources. In particular we demonstrate that
valuable information can be collected by following URLs included in tweets. We
automatically extract content from the corresponding web pages and treating
each web page as a document linked to the original tweet show how a temporal
topic model based on a hierarchical Dirichlet process can be used to track the
evolution of a complex topic structure of a Twitter community. Using
autism-related tweets we demonstrate that our method is capable of capturing a
much more meaningful picture of information exchange than user-chosen hashtags.Comment: IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, 201
Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks
Predicting the future health information of patients from the historical
Electronic Health Records (EHR) is a core research task in the development of
personalized healthcare. Patient EHR data consist of sequences of visits over
time, where each visit contains multiple medical codes, including diagnosis,
medication, and procedure codes. The most important challenges for this task
are to model the temporality and high dimensionality of sequential EHR data and
to interpret the prediction results. Existing work solves this problem by
employing recurrent neural networks (RNNs) to model EHR data and utilizing
simple attention mechanism to interpret the results. However, RNN-based
approaches suffer from the problem that the performance of RNNs drops when the
length of sequences is large, and the relationships between subsequent visits
are ignored by current RNN-based approaches. To address these issues, we
propose {\sf Dipole}, an end-to-end, simple and robust model for predicting
patients' future health information. Dipole employs bidirectional recurrent
neural networks to remember all the information of both the past visits and the
future visits, and it introduces three attention mechanisms to measure the
relationships of different visits for the prediction. With the attention
mechanisms, Dipole can interpret the prediction results effectively. Dipole
also allows us to interpret the learned medical code representations which are
confirmed positively by medical experts. Experimental results on two real world
EHR datasets show that the proposed Dipole can significantly improve the
prediction accuracy compared with the state-of-the-art diagnosis prediction
approaches and provide clinically meaningful interpretation
- …