3,772,915 research outputs found
Multimapper: Data Density Sensitive Topological Visualization
Mapper is an algorithm that summarizes the topological information contained
in a dataset and provides an insightful visualization. It takes as input a
point cloud which is possibly high-dimensional, a filter function on it and an
open cover on the range of the function. It returns the nerve simplicial
complex of the pullback of the cover. Mapper can be considered a discrete
approximation of the topological construct called Reeb space, as analysed in
the -dimensional case by [Carriere et al.,2018]. Despite its success in
obtaining insights in various fields such as in [Kamruzzaman et al., 2016],
Mapper is an ad hoc technique requiring lots of parameter tuning. There is also
no measure to quantify goodness of the resulting visualization, which often
deviates from the Reeb space in practice. In this paper, we introduce a new
cover selection scheme for data that reduces the obscuration of topological
information at both the computation and visualisation steps. To achieve this,
we replace global scale selection of cover with a scale selection scheme
sensitive to local density of data points. We also propose a method to detect
some deviations in Mapper from Reeb space via computation of persistence
features on the Mapper graph.Comment: Accepted at ICDM
Publishing and sharing sensitive data
Sensitive data has often been excluded from discussions about data publication and sharing. It was believed that sharing sensitive data is not ethical or that it is too difficult to do safely. This opinion has changed with greater understanding and use of methods to ‘de-sensitise’ (i.e., confidentialise) data; that is, modify the data to remove information so that participants or subjects are no longer identifiable, and the capacity to grant ‘conditional access’ to data. Requirements of publishers and funding bodies for researchers to publish and share their data have also seen sensitive data sharing increase.
This guide outlines best practice for the publication and sharing of sensitive research data in the Australian context. The Guide follows the sequence of steps that are necessary for publishing and sharing sensitive data, as outlined in the ‘Publishing and Sharing Sensitive Data Decision Tree’. It provides the detail and context to the steps in this Decision Tree. References for further reading are provided for those that are interested.
By following the sections below, and steps within, you will be able to make clear, lawful, and ethical decisions about sharing your data safely. It can be done in most cases!
How the Guide interacts with your institutional policies
This Guide is not intended to override institutional policies on data management or publication. Most researchers operate within the policies of their institution and/or funding arrangement and must, therefore, ensure their decisions about data publication align with these policies. This is particularly relevant for Intellectual Property, and sometimes, your classification of sensitive data (e.g., NSW Government Department of Environment & Heritage, Sensitive Data Species Policy) or selection of data repository. The Guide indicates the steps at which you should check your institutional policies
Simple data-driven context-sensitive lemmatization
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the
input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages
Optimal Las Vegas Locality Sensitive Data Structures
We show that approximate similarity (near neighbour) search can be solved in
high dimensions with performance matching state of the art (data independent)
Locality Sensitive Hashing, but with a guarantee of no false negatives.
Specifically, we give two data structures for common problems.
For -approximate near neighbour in Hamming space we get query time
and space matching that of
\cite{indyk1998approximate} and answering a long standing open question
from~\cite{indyk2000dimensionality} and~\cite{pagh2016locality} in the
affirmative.
By means of a new deterministic reduction from to Hamming we also
solve and with query time and space .
For -approximate Jaccard similarity we get query time
and space ,
, when sets have equal
size, matching the performance of~\cite{tobias2016}.
The algorithms are based on space partitions, as with classic LSH, but we
construct these using a combination of brute force, tensoring, perfect hashing
and splitter functions \`a la~\cite{naor1995splitters}. We also show a new
dimensionality reduction lemma with 1-sided error
Crime applications and social machines: crowdsourcing sensitive data
The authors explore some issues with the United Kingdom (U.K.) crime reporting and recording systems which currently produce Open Crime Data. The availability of Open Crime Data seems to create a potential data ecosystem which would encourage crowdsourcing, or the creation of social machines, in order to counter some of these issues. While such solutions are enticing, we suggest that in fact the theoretical solution brings to light fairly compelling problems, which highlight some limitations of crowdsourcing as a means of addressing Berners-Lee’s “social constraint.” The authors present a thought experiment – a Gendankenexperiment - in order to explore the implications, both good and bad, of a social machine in such a sensitive space and suggest a Web Science perspective to pick apart the ramifications of this thought experiment as a theoretical approach to the characterisation of social machine
Knowing Your Population: Privacy-Sensitive Mining of Massive Data
Location and mobility patterns of individuals are important to environmental
planning, societal resilience, public health, and a host of commercial
applications. Mining telecommunication traffic and transactions data for such
purposes is controversial, in particular raising issues of privacy. However,
our hypothesis is that privacy-sensitive uses are possible and often beneficial
enough to warrant considerable research and development efforts. Our work
contends that peoples behavior can yield patterns of both significant
commercial, and research, value. For such purposes, methods and algorithms for
mining telecommunication data to extract commonly used routes and locations,
articulated through time-geographical constructs, are described in a case study
within the area of transportation planning and analysis. From the outset, these
were designed to balance the privacy of subscribers and the added value of
mobility patterns derived from their mobile communication traffic and
transactions data. Our work directly contrasts the current, commonly held
notion that value can only be added to services by directly monitoring the
behavior of individuals, such as in current attempts at location-based
services. We position our work within relevant legal frameworks for privacy and
data protection, and show that our methods comply with such requirements and
also follow best-practice
Redrawing the Boundaries on Purchasing Data from Privacy-Sensitive Individuals
We prove new positive and negative results concerning the existence of
truthful and individually rational mechanisms for purchasing private data from
individuals with unbounded and sensitive privacy preferences. We strengthen the
impossibility results of Ghosh and Roth (EC 2011) by extending it to a much
wider class of privacy valuations. In particular, these include privacy
valuations that are based on ({\epsilon}, {\delta})-differentially private
mechanisms for non-zero {\delta}, ones where the privacy costs are measured in
a per-database manner (rather than taking the worst case), and ones that do not
depend on the payments made to players (which might not be observable to an
adversary). To bypass this impossibility result, we study a natural special
setting where individuals have mono- tonic privacy valuations, which captures
common contexts where certain values for private data are expected to lead to
higher valuations for privacy (e.g. having a particular disease). We give new
mech- anisms that are individually rational for all players with monotonic
privacy valuations, truthful for all players whose privacy valuations are not
too large, and accurate if there are not too many players with too-large
privacy valuations. We also prove matching lower bounds showing that in some
respects our mechanism cannot be improved significantly
Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing
We prove a tight lower bound for the exponent for data-dependent
Locality-Sensitive Hashing schemes, recently used to design efficient solutions
for the -approximate nearest neighbor search. In particular, our lower bound
matches the bound of for the space,
obtained via the recent algorithm from [Andoni-Razenshteyn, STOC'15].
In recent years it emerged that data-dependent hashing is strictly superior
to the classical Locality-Sensitive Hashing, when the hash function is
data-independent. In the latter setting, the best exponent has been already
known: for the space, the tight bound is , with the upper
bound from [Indyk-Motwani, STOC'98] and the matching lower bound from
[O'Donnell-Wu-Zhou, ITCS'11].
We prove that, even if the hashing is data-dependent, it must hold that
. To prove the result, we need to formalize the
exact notion of data-dependent hashing that also captures the complexity of the
hash functions (in addition to their collision properties). Without restricting
such complexity, we would allow for obviously infeasible solutions such as the
Voronoi diagram of a dataset. To preclude such solutions, we require our hash
functions to be succinct. This condition is satisfied by all the known
algorithmic results.Comment: 16 pages, no figure
- …
