31 research outputs found
Personalized modeling for prediction with decision-path models
Deriving predictive models in medicine typically relies on a population approach where a single model is developed from a dataset of individuals. In this paper we describe and evaluate a personalized approach in which we construct a new type of decision tree model called decision-path model that takes advantage of the particular features of a given person of interest. We introduce three personalized methods that derive personalized decision-path models. We compared the performance of these methods to that of Classification And Regression Tree (CART) that is a population decision tree to predict seven different outcomes in five medical datasets. Two of the three personalized methods performed statistically significantly better on area under the ROC curve (AUC) and Brier skill score compared to CART. The personalized approach of learning decision path models is a new approach for predictive modeling that can perform better than a population approach
A Deep Learning Anomaly Detection Method in Textual Data
In this article, we propose using deep learning and transformer architectures
combined with classical machine learning algorithms to detect and identify text
anomalies in texts. Deep learning model provides a very crucial context
information about the textual data which all textual context are converted to a
numerical representation. We used multiple machine learning methods such as
Sentence Transformers, Auto Encoders, Logistic Regression and Distance
calculation methods to predict anomalies. The method are tested on the texts
data and we used syntactic data from different source injected into the
original text as anomalies or use them as target. Different methods and
algorithm are explained in the field of outlier detection and the results of
the best technique is presented. These results suggest that our algorithm could
potentially reduce false positive rates compared with other anomaly detection
methods that we are testing.Comment: 8 Pages, 4 Figure
A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data
The interactive exploration of large and evolving datasets is challenging as
relationships between underlying variables may not be fully understood. There
may be hidden trends and patterns in the data that are worthy of further
exploration and analysis. We present a system that methodically explores
multiple combinations of variables using a searchlight technique and identifies
outliers. An iterative k-means clustering algorithm is applied to features
derived through a split-apply-combine paradigm used in the database literature.
Outliers are identified as singleton or small clusters. This algorithm is swept
across the dataset in a searchlight manner. The dimensions that contain
outliers are combined in pairs with other dimensions using a susbset scan
technique to gain further insight into the outliers. We illustrate this system
by anaylzing open health care data released by New York State. We apply our
iterative k-means searchlight followed by subset scanning. Several anomalous
trends in the data are identified, including cost overruns at specific
hospitals, and increases in diagnoses such as suicides. These constitute novel
findings in the literature, and are of potential use to regulatory agencies,
policy makers and concerned citizens.Comment: 2018 International Joint Conference on Neural Networks (IJCNN
RobustSPAM for Inference from Noisy Longitudinal Data and Preservation of Privacy
The availability of complex temporal datasets in social, health and consumer contexts has driven the development of pattern mining techniques that enable the use of classical machine learning tools for model building. In this work we introduce a robust temporal pattern mining framework for finding predictive patterns in complex timestamped multivariate and noisy data. We design an algorithm RobustSPAM that enables mining of temporal patterns from data with noisy timestamps. We apply our algorithm to social care data from a local government body and investigate how the efficiency and accuracy of the method depends on the level of noise. We further explore the trade-off between the loss of predictivity due to perturbation of timestamps and the risk of person re-identification