5,992 research outputs found
TimeCrypt: Encrypted Data Stream Processing at Scale with Cryptographic Access Control
A growing number of devices and services collect detailed time series data
that is stored in the cloud. Protecting the confidentiality of this vast and
continuously generated data is an acute need for many applications in this
space. At the same time, we must preserve the utility of this data by enabling
authorized services to securely and selectively access and run analytics. This
paper presents TimeCrypt, a system that provides scalable and real-time
analytics over large volumes of encrypted time series data. TimeCrypt allows
users to define expressive data access and privacy policies and enforces it
cryptographically via encryption. In TimeCrypt, data is encrypted end-to-end,
and authorized parties can only decrypt and verify queries within their
authorized access scope. Our evaluation of TimeCrypt shows that its memory
overhead and performance are competitive and close to operating on data in the
clear
Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used
In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling
Entity resolution (ER) presents unique challenges for evaluation methodology.
While crowdsourcing platforms acquire ground truth, sound approaches to
sampling must drive labelling efforts. In ER, extreme class imbalance between
matching and non-matching records can lead to enormous labelling requirements
when seeking statistically consistent estimates for rigorous evaluation. This
paper addresses this important challenge with the OASIS algorithm: a sampler
and F-measure estimator for ER evaluation. OASIS draws samples from a (biased)
instrumental distribution, chosen to ensure estimators with optimal asymptotic
variance. As new labels are collected OASIS updates this instrumental
distribution via a Bayesian latent variable model of the annotator oracle, to
quickly focus on unlabelled items providing more information. We prove that
resulting estimates of F-measure, precision, recall converge to the true
population values. Thorough comparisons of sampling methods on a variety of ER
datasets demonstrate significant labelling reductions of up to 83% without loss
to estimate accuracy.Comment: 13 pages, 5 figure
Clustering Via Crowdsourcing
In recent years, crowdsourcing, aka human aided computation has emerged as an
effective platform for solving problems that are considered complex for
machines alone. Using human is time-consuming and costly due to monetary
compensations. Therefore, a crowd based algorithm must judiciously use any
information computed through an automated process, and ask minimum number of
questions to the crowd adaptively.
One such problem which has received significant attention is {\em entity
resolution}. Formally, we are given a graph with unknown edge set
where is a union of (again unknown, but typically large ,
for ) disjoint cliques , . The goal is
to retrieve the sets s by making minimum number of pair-wise queries to an oracle (the crowd). When the answer to each query is
correct, e.g. via resampling, then this reduces to finding connected components
in a graph. On the other hand, when crowd answers may be incorrect, it
corresponds to clustering over minimum number of noisy inputs. Even, with
perfect answers, a simple lower and upper bound of on query
complexity can be shown. A major contribution of this paper is to reduce the
query complexity to linear or even sublinear in when mild side information
is provided by a machine, and even in presence of crowd errors which are not
correctable via resampling. We develop new information theoretic lower bounds
on the query complexity of clustering with side information and errors, and our
upper bounds closely match with them. Our algorithms are naturally
parallelizable, and also give near-optimal bounds on the number of adaptive
rounds required to match the query complexity.Comment: 36 page
Semisupervised Clustering by Queries and Locally Encodable Source Coding
Source coding is the canonical problem of data compression in information
theory. In a locally encodable source coding, each compressed bit depends on
only few bits of the input. In this paper, we show that a recently popular
model of semi-supervised clustering is equivalent to locally encodable source
coding. In this model, the task is to perform multiclass labeling of unlabeled
elements. At the beginning, we can ask in parallel a set of simple queries to
an oracle who provides (possibly erroneous) binary answers to the queries. The
queries cannot involve more than two (or a fixed constant number of) elements.
Now the labeling of all the elements (or clustering) must be performed based on
the noisy query answers. The goal is to recover all the correct labelings while
minimizing the number of such queries. The equivalence to locally encodable
source codes leads us to find lower bounds on the number of queries required in
a variety of scenarios. We provide querying schemes based on pairwise `same
cluster' queries - and pairwise AND queries and show provable performance
guarantees for each of the schemes.Comment: 16 pages, 11 figures. Some of the results of this paper have appeared
in the proceedings of the 2017 Conference on Neural Information Processing
Systems (NeurIPS 2017
Stance and Sentiment in Tweets
We can often detect from a person's utterances whether he/she is in favor of
or against a given target entity -- their stance towards the target. However, a
person may express the same stance towards a target by using negative or
positive language. Here for the first time we present a dataset of
tweet--target pairs annotated for both stance and sentiment. The targets may or
may not be referred to in the tweets, and they may or may not be the target of
opinion in the tweets. Partitions of this dataset were used as training and
test sets in a SemEval-2016 shared task competition. We propose a simple stance
detection system that outperforms submissions from all 19 teams that
participated in the shared task. Additionally, access to both stance and
sentiment annotations allows us to explore several research questions. We show
that while knowing the sentiment expressed by a tweet is beneficial for stance
classification, it alone is not sufficient. Finally, we use additional
unlabeled data through distant supervision techniques and word embeddings to
further improve stance classification.Comment: 22 page
Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals
We show that many machine learning goals, such as improved fairness metrics,
can be expressed as constraints on the model's predictions, which we call rate
constraints. We study the problem of training non-convex models subject to
these rate constraints (or any non-convex and non-differentiable constraints).
In the non-convex setting, the standard approach of Lagrange multipliers may
fail. Furthermore, if the constraints are non-differentiable, then one cannot
optimize the Lagrangian with gradient-based methods. To solve these issues, we
introduce the proxy-Lagrangian formulation. This new formulation leads to an
algorithm that produces a stochastic classifier by playing a two-player
non-zero-sum game solving for what we call a semi-coarse correlated
equilibrium, which in turn corresponds to an approximately optimal and feasible
solution to the constrained optimization problem. We then give a procedure
which shrinks the randomized solution down to one that is a mixture of at most
deterministic solutions, given constraints. This culminates in
algorithms that can solve non-convex constrained optimization problems with
possibly non-differentiable and non-convex constraints with theoretical
guarantees. We provide extensive experimental results enforcing a wide range of
policy goals including different fairness metrics, and other goals on accuracy,
coverage, recall, and churn
Combination Strategies for Semantic Role Labeling
This paper introduces and analyzes a battery of inference models for the
problem of semantic role labeling: one based on constraint satisfaction, and
several strategies that model the inference as a meta-learning problem using
discriminative classifiers. These classifiers are developed with a rich set of
novel features that encode proposition and sentence-level information. To our
knowledge, this is the first work that: (a) performs a thorough analysis of
learning-based inference models for semantic role labeling, and (b) compares
several inference strategies in this context. We evaluate the proposed
inference strategies in the framework of the CoNLL-2005 shared task using only
automatically-generated syntactic information. The extensive experimental
evaluation and analysis indicates that all the proposed inference strategies
are successful -they all outperform the current best results reported in the
CoNLL-2005 evaluation exercise- but each of the proposed approaches has its
advantages and disadvantages. Several important traits of a state-of-the-art
SRL combination strategy emerge from this analysis: (i) individual models
should be combined at the granularity of candidate arguments rather than at the
granularity of complete solutions; (ii) the best combination strategy uses an
inference model based in learning; and (iii) the learning-based inference
benefits from max-margin classifiers and global feedback
Evaluating the word-expert approach for Named-Entity Disambiguation
Named Entity Disambiguation (NED) is the task of linking a named-entity
mention to an instance in a knowledge-base, typically Wikipedia. This task is
closely related to word-sense disambiguation (WSD), where the supervised
word-expert approach has prevailed. In this work we present the results of the
word-expert approach to NED, where one classifier is built for each target
entity mention string. The resources necessary to build the system, a
dictionary and a set of training instances, have been automatically derived
from Wikipedia. We provide empirical evidence of the value of this approach, as
well as a study of the differences between WSD and NED, including ambiguity and
synonymy statistics
Context-Aware Hierarchical Online Learning for Performance Maximization in Mobile Crowdsourcing
In mobile crowdsourcing (MCS), mobile users accomplish outsourced human
intelligence tasks. MCS requires an appropriate task assignment strategy, since
different workers may have different performance in terms of acceptance rate
and quality. Task assignment is challenging, since a worker's performance (i)
may fluctuate, depending on both the worker's current personal context and the
task context, (ii) is not known a priori, but has to be learned over time.
Moreover, learning context-specific worker performance requires access to
context information, which may not be available at a central entity due to
communication overhead or privacy concerns. Additionally, evaluating worker
performance might require costly quality assessments. In this paper, we propose
a context-aware hierarchical online learning algorithm addressing the problem
of performance maximization in MCS. In our algorithm, a local controller (LC)
in the mobile device of a worker regularly observes the worker's context,
her/his decisions to accept or decline tasks and the quality in completing
tasks. Based on these observations, the LC regularly estimates the worker's
context-specific performance. The mobile crowdsourcing platform (MCSP) then
selects workers based on performance estimates received from the LCs. This
hierarchical approach enables the LCs to learn context-specific worker
performance and it enables the MCSP to select suitable workers. In addition,
our algorithm preserves worker context locally, and it keeps the number of
required quality assessments low. We prove that our algorithm converges to the
optimal task assignment strategy. Moreover, the algorithm outperforms simpler
task assignment strategies in experiments based on synthetic and real data.Comment: 18 pages, 10 figure
- …