3,772,915 research outputs found

    Multimapper: Data Density Sensitive Topological Visualization

    Full text link
    Mapper is an algorithm that summarizes the topological information contained in a dataset and provides an insightful visualization. It takes as input a point cloud which is possibly high-dimensional, a filter function on it and an open cover on the range of the function. It returns the nerve simplicial complex of the pullback of the cover. Mapper can be considered a discrete approximation of the topological construct called Reeb space, as analysed in the 11-dimensional case by [Carriere et al.,2018]. Despite its success in obtaining insights in various fields such as in [Kamruzzaman et al., 2016], Mapper is an ad hoc technique requiring lots of parameter tuning. There is also no measure to quantify goodness of the resulting visualization, which often deviates from the Reeb space in practice. In this paper, we introduce a new cover selection scheme for data that reduces the obscuration of topological information at both the computation and visualisation steps. To achieve this, we replace global scale selection of cover with a scale selection scheme sensitive to local density of data points. We also propose a method to detect some deviations in Mapper from Reeb space via computation of persistence features on the Mapper graph.Comment: Accepted at ICDM

    Publishing and sharing sensitive data

    Get PDF
    Sensitive data has often been excluded from discussions about data publication and sharing. It was believed that sharing sensitive data is not ethical or that it is too difficult to do safely. This opinion has changed with greater understanding and use of methods to ‘de-sensitise’ (i.e., confidentialise) data; that is, modify the data to remove information so that participants or subjects are no longer identifiable, and the capacity to grant ‘conditional access’ to data. Requirements of publishers and funding bodies for researchers to publish and share their data have also seen sensitive data sharing increase. This guide outlines best practice for the publication and sharing of sensitive research data in the Australian context. The Guide follows the sequence of steps that are necessary for publishing and sharing sensitive data, as outlined in the ‘Publishing and Sharing Sensitive Data Decision Tree’. It provides the detail and context to the steps in this Decision Tree. References for further reading are provided for those that are interested. By following the sections below, and steps within, you will be able to make clear, lawful, and ethical decisions about sharing your data safely. It can be done in most cases! How the Guide interacts with your institutional policies This Guide is not intended to override institutional policies on data management or publication. Most researchers operate within the policies of their institution and/or funding arrangement and must, therefore, ensure their decisions about data publication align with these policies. This is particularly relevant for Intellectual Property, and sometimes, your classification of sensitive data (e.g., NSW Government Department of Environment & Heritage, Sensitive Data Species Policy) or selection of data repository. The Guide indicates the steps at which you should check your institutional policies

    Simple data-driven context-sensitive lemmatization

    Get PDF
    Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages

    Optimal Las Vegas Locality Sensitive Data Structures

    Full text link
    We show that approximate similarity (near neighbour) search can be solved in high dimensions with performance matching state of the art (data independent) Locality Sensitive Hashing, but with a guarantee of no false negatives. Specifically, we give two data structures for common problems. For cc-approximate near neighbour in Hamming space we get query time dn1/c+o(1)dn^{1/c+o(1)} and space dn1+1/c+o(1)dn^{1+1/c+o(1)} matching that of \cite{indyk1998approximate} and answering a long standing open question from~\cite{indyk2000dimensionality} and~\cite{pagh2016locality} in the affirmative. By means of a new deterministic reduction from 1\ell_1 to Hamming we also solve 1\ell_1 and 2\ell_2 with query time d2n1/c+o(1)d^2n^{1/c+o(1)} and space d2n1+1/c+o(1)d^2 n^{1+1/c+o(1)}. For (s1,s2)(s_1,s_2)-approximate Jaccard similarity we get query time dnρ+o(1)dn^{\rho+o(1)} and space dn1+ρ+o(1)dn^{1+\rho+o(1)}, ρ=log1+s12s1/log1+s22s2\rho=\log\frac{1+s_1}{2s_1}\big/\log\frac{1+s_2}{2s_2}, when sets have equal size, matching the performance of~\cite{tobias2016}. The algorithms are based on space partitions, as with classic LSH, but we construct these using a combination of brute force, tensoring, perfect hashing and splitter functions \`a la~\cite{naor1995splitters}. We also show a new dimensionality reduction lemma with 1-sided error

    Crime applications and social machines: crowdsourcing sensitive data

    No full text
    The authors explore some issues with the United Kingdom (U.K.) crime reporting and recording systems which currently produce Open Crime Data. The availability of Open Crime Data seems to create a potential data ecosystem which would encourage crowdsourcing, or the creation of social machines, in order to counter some of these issues. While such solutions are enticing, we suggest that in fact the theoretical solution brings to light fairly compelling problems, which highlight some limitations of crowdsourcing as a means of addressing Berners-Lee’s “social constraint.” The authors present a thought experiment – a Gendankenexperiment - in order to explore the implications, both good and bad, of a social machine in such a sensitive space and suggest a Web Science perspective to pick apart the ramifications of this thought experiment as a theoretical approach to the characterisation of social machine

    Knowing Your Population: Privacy-Sensitive Mining of Massive Data

    Full text link
    Location and mobility patterns of individuals are important to environmental planning, societal resilience, public health, and a host of commercial applications. Mining telecommunication traffic and transactions data for such purposes is controversial, in particular raising issues of privacy. However, our hypothesis is that privacy-sensitive uses are possible and often beneficial enough to warrant considerable research and development efforts. Our work contends that peoples behavior can yield patterns of both significant commercial, and research, value. For such purposes, methods and algorithms for mining telecommunication data to extract commonly used routes and locations, articulated through time-geographical constructs, are described in a case study within the area of transportation planning and analysis. From the outset, these were designed to balance the privacy of subscribers and the added value of mobility patterns derived from their mobile communication traffic and transactions data. Our work directly contrasts the current, commonly held notion that value can only be added to services by directly monitoring the behavior of individuals, such as in current attempts at location-based services. We position our work within relevant legal frameworks for privacy and data protection, and show that our methods comply with such requirements and also follow best-practice

    Redrawing the Boundaries on Purchasing Data from Privacy-Sensitive Individuals

    Full text link
    We prove new positive and negative results concerning the existence of truthful and individually rational mechanisms for purchasing private data from individuals with unbounded and sensitive privacy preferences. We strengthen the impossibility results of Ghosh and Roth (EC 2011) by extending it to a much wider class of privacy valuations. In particular, these include privacy valuations that are based on ({\epsilon}, {\delta})-differentially private mechanisms for non-zero {\delta}, ones where the privacy costs are measured in a per-database manner (rather than taking the worst case), and ones that do not depend on the payments made to players (which might not be observable to an adversary). To bypass this impossibility result, we study a natural special setting where individuals have mono- tonic privacy valuations, which captures common contexts where certain values for private data are expected to lead to higher valuations for privacy (e.g. having a particular disease). We give new mech- anisms that are individually rational for all players with monotonic privacy valuations, truthful for all players whose privacy valuations are not too large, and accurate if there are not too many players with too-large privacy valuations. We also prove matching lower bounds showing that in some respects our mechanism cannot be improved significantly

    Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing

    Get PDF
    We prove a tight lower bound for the exponent ρ\rho for data-dependent Locality-Sensitive Hashing schemes, recently used to design efficient solutions for the cc-approximate nearest neighbor search. In particular, our lower bound matches the bound of ρ12c1+o(1)\rho\le \frac{1}{2c-1}+o(1) for the 1\ell_1 space, obtained via the recent algorithm from [Andoni-Razenshteyn, STOC'15]. In recent years it emerged that data-dependent hashing is strictly superior to the classical Locality-Sensitive Hashing, when the hash function is data-independent. In the latter setting, the best exponent has been already known: for the 1\ell_1 space, the tight bound is ρ=1/c\rho=1/c, with the upper bound from [Indyk-Motwani, STOC'98] and the matching lower bound from [O'Donnell-Wu-Zhou, ITCS'11]. We prove that, even if the hashing is data-dependent, it must hold that ρ12c1o(1)\rho\ge \frac{1}{2c-1}-o(1). To prove the result, we need to formalize the exact notion of data-dependent hashing that also captures the complexity of the hash functions (in addition to their collision properties). Without restricting such complexity, we would allow for obviously infeasible solutions such as the Voronoi diagram of a dataset. To preclude such solutions, we require our hash functions to be succinct. This condition is satisfied by all the known algorithmic results.Comment: 16 pages, no figure
    corecore