6,740 research outputs found
Random projection to preserve patient privacy
With the availability of accessible and widely used cloud services, it is natural that large components of healthcare systems migrate to them; for example, patient databases can be stored and processed in the cloud. Such cloud services provide enhanced flexibility and additional gains, such as availability, ease of data share, and so on. This trend poses serious threats regarding the privacy of the patients and the trust that an individual must put into the healthcare system itself. Thus, there is a strong need of privacy preservation, achieved through a variety of different approaches. In this paper, we study the application of a random projection-based approach to patient data as a means to achieve two goals: (1) provably mask the identity of users under some adversarial-attack settings, (2) preserve enough information to allow for aggregate data analysis and application of machine-learning techniques. As far as we know, such approaches have not been applied and tested on medical data. We analyze the tradeoff between the loss of accuracy on the outcome of machine-learning algorithms and the resilience against an adversary. We show that random projections proved to be strong against known input/output attacks while offering high quality data, as long as the projected space is smaller than the original space, and as long as the amount of leaked data available to the adversary is limited
High-dimensional classification using features annealed independence rules
Classification using high-dimensional features arises frequently in many
contemporary statistical studies such as tumor classification using microarray
or other high-throughput data. The impact of dimensionality on classifications
is poorly understood. In a seminal paper, Bickel and Levina [Bernoulli 10
(2004) 989--1010] show that the Fisher discriminant performs poorly due to
diverging spectra and they propose to use the independence rule to overcome the
problem. We first demonstrate that even for the independence classification
rule, classification using all the features can be as poor as the random
guessing due to noise accumulation in estimating population centroids in
high-dimensional feature space. In fact, we demonstrate further that almost all
linear discriminants can perform as poorly as the random guessing. Thus, it is
important to select a subset of important features for high-dimensional
classification, resulting in Features Annealed Independence Rules (FAIR). The
conditions under which all the important features can be selected by the
two-sample -statistic are established. The choice of the optimal number of
features, or equivalently, the threshold value of the test statistics are
proposed based on an upper bound of the classification error. Simulation
studies and real data analysis support our theoretical results and demonstrate
convincingly the advantage of our new classification procedure.Comment: Published in at http://dx.doi.org/10.1214/07-AOS504 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Dimension reduction for linear separation with curvilinear distances
Any high dimensional data in its original raw form may contain obviously classifiable clusters which are difficult to identify given the high-dimension representation. In reducing the dimensions it may be possible to perform a simple classification technique to extract this cluster information whilst retaining the overall topology of the data set. The supervised method presented here takes a high dimension data set consisting of multiple clusters and employs curvilinear distance as a relation between points, projecting in a lower dimension according to this relationship. This representation allows for linear separation of the non-separable high dimensional cluster data and the classification to a cluster of any successive unseen data point extracted from the same higher dimension
- …