3,352 research outputs found
Utilizing Data Mining Techniques and Ensemble Learning to Predict Development of Surgical Site Infections in Gynecologic Cancer Patients
Surgical site infections are costly to both patients and hospitals, increase patient mortality, and are the most common form of a hospital acquired infection. Gynecological cancer surgery patients are already at higher risk of developing an infection due to the suppression of their immune system. This research leverages popular data mining techniques to create a prediction model to identify high risk patients. Implemented techniques include logistic regression, naive Bayes, recursive partitioning and regression trees, random forest, feed forward neural network, k-nearest neighbor, and support vector machines with linear kernel. Weighted stacked generalization was implemented to improve upon the individual base level model’s performance. The chosen meta level classifiers were support vector machines with linear kernel, logistic regression, and k-nearest neighbor. The result is a model that identifies high-risk patients immediately following a surgical procedure with an AUC of 0.6864, accuracy of 0.6744, sensitivity of 0.7, and specificity of 0.6728
Multiclass Data Segmentation using Diffuse Interface Methods on Graphs
We present two graph-based algorithms for multiclass segmentation of
high-dimensional data. The algorithms use a diffuse interface model based on
the Ginzburg-Landau functional, related to total variation compressed sensing
and image processing. A multiclass extension is introduced using the Gibbs
simplex, with the functional's double-well potential modified to handle the
multiclass case. The first algorithm minimizes the functional using a convex
splitting numerical scheme. The second algorithm is a uses a graph adaptation
of the classical numerical Merriman-Bence-Osher (MBO) scheme, which alternates
between diffusion and thresholding. We demonstrate the performance of both
algorithms experimentally on synthetic data, grayscale and color images, and
several benchmark data sets such as MNIST, COIL and WebKB. We also make use of
fast numerical solvers for finding the eigenvectors and eigenvalues of the
graph Laplacian, and take advantage of the sparsity of the matrix. Experiments
indicate that the results are competitive with or better than the current
state-of-the-art multiclass segmentation algorithms.Comment: 14 page
Ranking Median Regression: Learning to Order through Local Consensus
This article is devoted to the problem of predicting the value taken by a
random permutation , describing the preferences of an individual over a
set of numbered items say, based on the observation of
an input/explanatory r.v. e.g. characteristics of the individual), when
error is measured by the Kendall distance. In the probabilistic
formulation of the 'Learning to Order' problem we propose, which extends the
framework for statistical Kemeny ranking aggregation developped in
\citet{CKS17}, this boils down to recovering conditional Kemeny medians of
given from i.i.d. training examples . For this reason, this statistical learning problem is
referred to as \textit{ranking median regression} here. Our contribution is
twofold. We first propose a probabilistic theory of ranking median regression:
the set of optimal elements is characterized, the performance of empirical risk
minimizers is investigated in this context and situations where fast learning
rates can be achieved are also exhibited. Next we introduce the concept of
local consensus/median, in order to derive efficient methods for ranking median
regression. The major advantage of this local learning approach lies in its
close connection with the widely studied Kemeny aggregation problem. From an
algorithmic perspective, this permits to build predictive rules for ranking
median regression by implementing efficient techniques for (approximate) Kemeny
median computations at a local level in a tractable manner. In particular,
versions of -nearest neighbor and tree-based methods, tailored to ranking
median regression, are investigated. Accuracy of piecewise constant ranking
median regression rules is studied under a specific smoothness assumption for
's conditional distribution given
Representing complex data using localized principal components with application to astronomical data
Often the relation between the variables constituting a multivariate data
space might be characterized by one or more of the terms: ``nonlinear'',
``branched'', ``disconnected'', ``bended'', ``curved'', ``heterogeneous'', or,
more general, ``complex''. In these cases, simple principal component analysis
(PCA) as a tool for dimension reduction can fail badly. Of the many alternative
approaches proposed so far, local approximations of PCA are among the most
promising. This paper will give a short review of localized versions of PCA,
focusing on local principal curves and local partitioning algorithms.
Furthermore we discuss projections other than the local principal components.
When performing local dimension reduction for regression or classification
problems it is important to focus not only on the manifold structure of the
covariates, but also on the response variable(s). Local principal components
only achieve the former, whereas localized regression approaches concentrate on
the latter. Local projection directions derived from the partial least squares
(PLS) algorithm offer an interesting trade-off between these two objectives. We
apply these methods to several real data sets. In particular, we consider
simulated astrophysical data from the future Galactic survey mission Gaia.Comment: 25 pages. In "Principal Manifolds for Data Visualization and
Dimension Reduction", A. Gorban, B. Kegl, D. Wunsch, and A. Zinovyev (eds),
Lecture Notes in Computational Science and Engineering, Springer, 2007, pp.
180--204,
http://www.springer.com/dal/home/generic/search/results?SGWID=1-40109-22-173750210-
The New Software Package for Dynamic Hierarchical Clustering for Circles Types of Shapes
In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in
large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of
methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods
for clustering mixed numerical and categorical data in large databases. One of the most accuracy approach
based on dynamic modeling of cluster similarity is called Chameleon. In this paper we present a modified
hierarchical clustering algorithm that used the main idea of Chameleon and the effectiveness of suggested
approach will be demonstrated by the experimental results
- …