Search CORE

3,772,915 research outputs found

Multimapper: Data Density Sensitive Topological Visualization

Author: Deb Bishal
Gupta Piyush
Krishnamurthy Balaji
Kumari Nupur
Rupela Akash
Sarkar Ankita
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/03/2019
Field of study

Mapper is an algorithm that summarizes the topological information contained in a dataset and provides an insightful visualization. It takes as input a point cloud which is possibly high-dimensional, a filter function on it and an open cover on the range of the function. It returns the nerve simplicial complex of the pullback of the cover. Mapper can be considered a discrete approximation of the topological construct called Reeb space, as analysed in the

1

-dimensional case by [Carriere et al.,2018]. Despite its success in obtaining insights in various fields such as in [Kamruzzaman et al., 2016], Mapper is an ad hoc technique requiring lots of parameter tuning. There is also no measure to quantify goodness of the resulting visualization, which often deviates from the Reeb space in practice. In this paper, we introduce a new cover selection scheme for data that reduces the obscuration of topological information at both the computation and visualisation steps. To achieve this, we replace global scale selection of cover with a scale selection scheme sensitive to local density of data points. We also propose a method to detect some deviations in Mapper from Reeb space via computation of persistence features on the Mapper graph.Comment: Accepted at ICDM

arXiv.org e-Print Archive

Crossref

Publishing and sharing sensitive data

Author: Sarah Olesen
Publication venue: Australian National Data Service
Publication date
Field of study

Sensitive data has often been excluded from discussions about data publication and sharing. It was believed that sharing sensitive data is not ethical or that it is too difficult to do safely. This opinion has changed with greater understanding and use of methods to ‘de-sensitise’ (i.e., confidentialise) data; that is, modify the data to remove information so that participants or subjects are no longer identifiable, and the capacity to grant ‘conditional access’ to data. Requirements of publishers and funding bodies for researchers to publish and share their data have also seen sensitive data sharing increase. This guide outlines best practice for the publication and sharing of sensitive research data in the Australian context. The Guide follows the sequence of steps that are necessary for publishing and sharing sensitive data, as outlined in the ‘Publishing and Sharing Sensitive Data Decision Tree’. It provides the detail and context to the steps in this Decision Tree. References for further reading are provided for those that are interested. By following the sections below, and steps within, you will be able to make clear, lawful, and ethical decisions about sharing your data safely. It can be done in most cases! How the Guide interacts with your institutional policies This Guide is not intended to override institutional policies on data management or publication. Most researchers operate within the policies of their institution and/or funding arrangement and must, therefore, ensure their decisions about data publication align with these policies. This is particularly relevant for Intellectual Property, and sometimes, your classification of sensitive data (e.g., NSW Government Department of Environment & Heritage, Sensitive Data Species Policy) or selection of data repository. The Guide indicates the steps at which you should check your institutional policies

Analysis and Policy Observatory (APO)

Simple data-driven context-sensitive lemmatization

Author: Chrupała Grzegorz
Publication venue
Publication date: 01/01/2006
Field of study

Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages

RUa Reposity University of Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Irish Universities

DCU Online Research Access Service

Optimal Las Vegas Locality Sensitive Data Structures

Author: Ahle Thomas Dybdahl
Publication venue
Publication date: 01/10/2017
Field of study

We show that approximate similarity (near neighbour) search can be solved in high dimensions with performance matching state of the art (data independent) Locality Sensitive Hashing, but with a guarantee of no false negatives. Specifically, we give two data structures for common problems. For

c

-approximate near neighbour in Hamming space we get query time

dn^{1/c+o(1)}

and space

dn^{1+1/c+o(1)}

matching that of \cite{indyk1998approximate} and answering a long standing open question from~\cite{indyk2000dimensionality} and~\cite{pagh2016locality} in the affirmative. By means of a new deterministic reduction from

\ell_1

to Hamming we also solve

\ell_1

and

\ell_2

with query time

d^2n^{1/c+o(1)}

and space

d^2 n^{1+1/c+o(1)}

. For

(s_1,s_2)

-approximate Jaccard similarity we get query time

dn^{\rho+o(1)}

and space

dn^{1+\rho+o(1)}

\rho=\log\frac{1+s_1}{2s_1}\big/\log\frac{1+s_2}{2s_2}

, when sets have equal size, matching the performance of~\cite{tobias2016}. The algorithms are based on space partitions, as with classic LSH, but we construct these using a combination of brute force, tensoring, perfect hashing and splitter functions \`a la~\cite{naor1995splitters}. We also show a new dimensionality reduction lemma with 1-sided error

arXiv.org e-Print Archive

Crossref

Crime applications and social machines: crowdsourcing sensitive data

Author: Byrne Evans Maire
O'Hara Kieron
Tiropanis Thanassis
Webber Craig
Publication venue
Publication date: 13/05/2013
Field of study

The authors explore some issues with the United Kingdom (U.K.) crime reporting and recording systems which currently produce Open Crime Data. The availability of Open Crime Data seems to create a potential data ecosystem which would encourage crowdsourcing, or the creation of social machines, in order to counter some of these issues. While such solutions are enticing, we suggest that in fact the theoretical solution brings to light fairly compelling problems, which highlight some limitations of crowdsourcing as a means of addressing Berners-Lee’s “social constraint.” The authors present a thought experiment – a Gendankenexperiment - in order to explore the implications, both good and bad, of a social machine in such a sensitive space and suggest a Web Science perspective to pick apart the ramifications of this thought experiment as a theoretical approach to the characterisation of social machine

Southampton (e-Prints Soton)

Knowing Your Population: Privacy-Sensitive Mining of Massive Data

Author: Boman Magnus
Bylund Markus
Hirsch Benjamin
Sanches Pedro
Svee Eric-Oluf
Publication venue: 'Canadian Center of Science and Education'
Publication date: 01/01/2013
Field of study

Location and mobility patterns of individuals are important to environmental planning, societal resilience, public health, and a host of commercial applications. Mining telecommunication traffic and transactions data for such purposes is controversial, in particular raising issues of privacy. However, our hypothesis is that privacy-sensitive uses are possible and often beneficial enough to warrant considerable research and development efforts. Our work contends that peoples behavior can yield patterns of both significant commercial, and research, value. For such purposes, methods and algorithms for mining telecommunication data to extract commonly used routes and locations, articulated through time-geographical constructs, are described in a case study within the area of transportation planning and analysis. From the outset, these were designed to balance the privacy of subscribers and the added value of mobility patterns derived from their mobile communication traffic and transactions data. Our work directly contrasts the current, commonly held notion that value can only be added to services by directly monitoring the behavior of individuals, such as in current attempts at location-based services. We position our work within relevant legal frameworks for privacy and data protection, and show that our methods comply with such requirements and also follow best-practice

arXiv.org e-Print Archive

Publikationer från KTH

Crossref

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swepub

Redrawing the Boundaries on Purchasing Data from Privacy-Sensitive Individuals

Author: Nissim Kobbi
Vadhan Salil
Xiao David
Publication venue
Publication date: 01/01/2014
Field of study

We prove new positive and negative results concerning the existence of truthful and individually rational mechanisms for purchasing private data from individuals with unbounded and sensitive privacy preferences. We strengthen the impossibility results of Ghosh and Roth (EC 2011) by extending it to a much wider class of privacy valuations. In particular, these include privacy valuations that are based on ({\epsilon}, {\delta})-differentially private mechanisms for non-zero {\delta}, ones where the privacy costs are measured in a per-database manner (rather than taking the worst case), and ones that do not depend on the payments made to players (which might not be observable to an adversary). To bypass this impossibility result, we study a natural special setting where individuals have mono- tonic privacy valuations, which captures common contexts where certain values for private data are expected to lead to higher valuations for privacy (e.g. having a particular disease). We give new mech- anisms that are individually rational for all players with monotonic privacy valuations, truthful for all players whose privacy valuations are not too large, and accurate if there are not too many players with too-large privacy valuations. We also prove matching lower bounds showing that in some respects our mechanism cannot be improved significantly

arXiv.org e-Print Archive

CiteSeerX

Crossref

Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing

Author: Andoni Alexandr
Razenshteyn Ilya
Publication venue
Publication date: 01/01/2015
Field of study

We prove a tight lower bound for the exponent

\rho

for data-dependent Locality-Sensitive Hashing schemes, recently used to design efficient solutions for the

c

-approximate nearest neighbor search. In particular, our lower bound matches the bound of

\rho\le \frac{1}{2c-1}+o(1)

for the

\ell_1

space, obtained via the recent algorithm from [Andoni-Razenshteyn, STOC'15]. In recent years it emerged that data-dependent hashing is strictly superior to the classical Locality-Sensitive Hashing, when the hash function is data-independent. In the latter setting, the best exponent has been already known: for the

\ell_1

space, the tight bound is

\rho=1/c

, with the upper bound from [Indyk-Motwani, STOC'98] and the matching lower bound from [O'Donnell-Wu-Zhou, ITCS'11]. We prove that, even if the hashing is data-dependent, it must hold that

\rho\ge \frac{1}{2c-1}-o(1)

. To prove the result, we need to formalize the exact notion of data-dependent hashing that also captures the complexity of the hash functions (in addition to their collision properties). Without restricting such complexity, we would allow for obviously infeasible solutions such as the Voronoi diagram of a dataset. To preclude such solutions, we require our hash functions to be succinct. This condition is satisfied by all the known algorithmic results.Comment: 16 pages, no figure

arXiv.org e-Print Archive

CiteSeerX

DROPS Dagstuhl Research Online Publication Server