69 research outputs found

    Variational Deep Semantic Hashing for Text Documents

    Full text link
    As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing. In this paper, we propose a series of novel deep document generative models for text hashing. The first proposed model is unsupervised while the second one is supervised by utilizing document labels/tags for hashing. The third model further considers document-specific factors that affect the generation of words. The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on four public testbeds. The experimental results have demonstrated the effectiveness of the proposed supervised learning models for text hashing.Comment: 11 pages, 4 figure

    Graph Perturbation as Noise Graph Addition: A New Perspective for Graph Anonymization

    Get PDF
    Different types of data privacy techniques have been applied to graphs and social networks. They have been used under different assumptions on intruders’ knowledge. i.e., different assumptions on what can lead to disclosure. The analysis of different methods is also led by how data protection techniques influence the analysis of the data. i.e., information loss or data utility. One of the techniques proposed for graph is graph perturbation. Several algorithms have been proposed for this purpose. They proceed adding or removing edges, although some also consider adding and removing nodes. In this paper we propose the study of these graph perturbation techniques from a different perspective. Following the model of standard database perturbation as noise addition, we propose to study graph perturbation as noise graph addition. We think that changing the perspective of graph sanitization in this direction will permit to study the properties of perturbed graphs in a more systematic way

    De-identifying a public use microdata file from the Canadian national discharge abstract database

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records.</p> <p>Methods</p> <p>Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.</p> <p>Results</p> <p>Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.</p> <p>Conclusions</p> <p>The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p

    Bone diagenesis: New data from infrared spectroscopy and X-ray diffraction

    No full text
    This paper combines non-destructive high-resolution Fourier transform infrared spectroscopic techniques (attenuated total reflectance in the mid-infrared - ATR, and diffuse reflectance in the near-infrared - NIR) with X-ray diffraction and Rietveld analysis, in the study of bone diagenesis. Sixty fossil bones from two Upper Miocene sites in Greece (Pikermi and Chalkoutsi) and one Upper Pleistocene site in Cyprus (Aghia Napa) are investigated in comparison to various mineral and biological apatites. Diagenetic trends, common to all these sites include a subtle but systematic decrease of the unit cell volume and a-axis of carbonate hydroxylapatite, as well as a parallel increase of the coherence length along the c-axis. Chemometric modelling reveals that the changes in the unit cell and the coherence length are highly correlated to (and can be predicted on the basis of) the ATR spectra. Besides using chemometrics as a convenient predictive tool, we have been able to identify that the correlation with the XRD data is primarily based on the intensity of infrared bands at 577, 865 and 1092 cm- 1, as well as on the position of the ν1 phosphate mode at ca. 960 cm- 1. These structural changes constitute the vibrational signature of diagenesis throughout our set of bone samples and can be accounted for by the stabilization of a distorted CO32- species in the B-sites of apatite, and to a lesser extent by the substitution of OH- by F-. NIR spectroscopy allowed for the identification of a well-defined H2O species, absorbing at 5318 and 7240 cm- 1. This species is labile, appears to characterize mostly biogenic apatite, and is therefore considered to be chemisorbed on the surface of the crystallites. © 2008
    corecore