5,992 research outputs found

    TimeCrypt: Encrypted Data Stream Processing at Scale with Cryptographic Access Control

    Full text link
    A growing number of devices and services collect detailed time series data that is stored in the cloud. Protecting the confidentiality of this vast and continuously generated data is an acute need for many applications in this space. At the same time, we must preserve the utility of this data by enabling authorized services to securely and selectively access and run analytics. This paper presents TimeCrypt, a system that provides scalable and real-time analytics over large volumes of encrypted time series data. TimeCrypt allows users to define expressive data access and privacy policies and enforces it cryptographically via encryption. In TimeCrypt, data is encrypted end-to-end, and authorized parties can only decrypt and verify queries within their authorized access scope. Our evaluation of TimeCrypt shows that its memory overhead and performance are competitive and close to operating on data in the clear

    Qualitative Effects of Knowledge Rules in Probabilistic Data Integration

    Get PDF
    One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used

    In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling

    Full text link
    Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates for rigorous evaluation. This paper addresses this important challenge with the OASIS algorithm: a sampler and F-measure estimator for ER evaluation. OASIS draws samples from a (biased) instrumental distribution, chosen to ensure estimators with optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on unlabelled items providing more information. We prove that resulting estimates of F-measure, precision, recall converge to the true population values. Thorough comparisons of sampling methods on a variety of ER datasets demonstrate significant labelling reductions of up to 83% without loss to estimate accuracy.Comment: 13 pages, 5 figure

    Clustering Via Crowdsourcing

    Full text link
    In recent years, crowdsourcing, aka human aided computation has emerged as an effective platform for solving problems that are considered complex for machines alone. Using human is time-consuming and costly due to monetary compensations. Therefore, a crowd based algorithm must judiciously use any information computed through an automated process, and ask minimum number of questions to the crowd adaptively. One such problem which has received significant attention is {\em entity resolution}. Formally, we are given a graph G=(V,E)G=(V,E) with unknown edge set EE where GG is a union of kk (again unknown, but typically large O(nα)O(n^\alpha), for α>0\alpha>0) disjoint cliques Gi(Vi,Ei)G_i(V_i, E_i), i=1,,ki =1, \dots, k. The goal is to retrieve the sets ViV_is by making minimum number of pair-wise queries V×V{±1}V \times V\to\{\pm1\} to an oracle (the crowd). When the answer to each query is correct, e.g. via resampling, then this reduces to finding connected components in a graph. On the other hand, when crowd answers may be incorrect, it corresponds to clustering over minimum number of noisy inputs. Even, with perfect answers, a simple lower and upper bound of Θ(nk)\Theta(nk) on query complexity can be shown. A major contribution of this paper is to reduce the query complexity to linear or even sublinear in nn when mild side information is provided by a machine, and even in presence of crowd errors which are not correctable via resampling. We develop new information theoretic lower bounds on the query complexity of clustering with side information and errors, and our upper bounds closely match with them. Our algorithms are naturally parallelizable, and also give near-optimal bounds on the number of adaptive rounds required to match the query complexity.Comment: 36 page

    Semisupervised Clustering by Queries and Locally Encodable Source Coding

    Full text link
    Source coding is the canonical problem of data compression in information theory. In a locally encodable source coding, each compressed bit depends on only few bits of the input. In this paper, we show that a recently popular model of semi-supervised clustering is equivalent to locally encodable source coding. In this model, the task is to perform multiclass labeling of unlabeled elements. At the beginning, we can ask in parallel a set of simple queries to an oracle who provides (possibly erroneous) binary answers to the queries. The queries cannot involve more than two (or a fixed constant number of) elements. Now the labeling of all the elements (or clustering) must be performed based on the noisy query answers. The goal is to recover all the correct labelings while minimizing the number of such queries. The equivalence to locally encodable source codes leads us to find lower bounds on the number of queries required in a variety of scenarios. We provide querying schemes based on pairwise `same cluster' queries - and pairwise AND queries and show provable performance guarantees for each of the schemes.Comment: 16 pages, 11 figures. Some of the results of this paper have appeared in the proceedings of the 2017 Conference on Neural Information Processing Systems (NeurIPS 2017

    Stance and Sentiment in Tweets

    Full text link
    We can often detect from a person's utterances whether he/she is in favor of or against a given target entity -- their stance towards the target. However, a person may express the same stance towards a target by using negative or positive language. Here for the first time we present a dataset of tweet--target pairs annotated for both stance and sentiment. The targets may or may not be referred to in the tweets, and they may or may not be the target of opinion in the tweets. Partitions of this dataset were used as training and test sets in a SemEval-2016 shared task competition. We propose a simple stance detection system that outperforms submissions from all 19 teams that participated in the shared task. Additionally, access to both stance and sentiment annotations allows us to explore several research questions. We show that while knowing the sentiment expressed by a tweet is beneficial for stance classification, it alone is not sufficient. Finally, we use additional unlabeled data through distant supervision techniques and word embeddings to further improve stance classification.Comment: 22 page

    Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals

    Full text link
    We show that many machine learning goals, such as improved fairness metrics, can be expressed as constraints on the model's predictions, which we call rate constraints. We study the problem of training non-convex models subject to these rate constraints (or any non-convex and non-differentiable constraints). In the non-convex setting, the standard approach of Lagrange multipliers may fail. Furthermore, if the constraints are non-differentiable, then one cannot optimize the Lagrangian with gradient-based methods. To solve these issues, we introduce the proxy-Lagrangian formulation. This new formulation leads to an algorithm that produces a stochastic classifier by playing a two-player non-zero-sum game solving for what we call a semi-coarse correlated equilibrium, which in turn corresponds to an approximately optimal and feasible solution to the constrained optimization problem. We then give a procedure which shrinks the randomized solution down to one that is a mixture of at most m+1m+1 deterministic solutions, given mm constraints. This culminates in algorithms that can solve non-convex constrained optimization problems with possibly non-differentiable and non-convex constraints with theoretical guarantees. We provide extensive experimental results enforcing a wide range of policy goals including different fairness metrics, and other goals on accuracy, coverage, recall, and churn

    Combination Strategies for Semantic Role Labeling

    Full text link
    This paper introduces and analyzes a battery of inference models for the problem of semantic role labeling: one based on constraint satisfaction, and several strategies that model the inference as a meta-learning problem using discriminative classifiers. These classifiers are developed with a rich set of novel features that encode proposition and sentence-level information. To our knowledge, this is the first work that: (a) performs a thorough analysis of learning-based inference models for semantic role labeling, and (b) compares several inference strategies in this context. We evaluate the proposed inference strategies in the framework of the CoNLL-2005 shared task using only automatically-generated syntactic information. The extensive experimental evaluation and analysis indicates that all the proposed inference strategies are successful -they all outperform the current best results reported in the CoNLL-2005 evaluation exercise- but each of the proposed approaches has its advantages and disadvantages. Several important traits of a state-of-the-art SRL combination strategy emerge from this analysis: (i) individual models should be combined at the granularity of candidate arguments rather than at the granularity of complete solutions; (ii) the best combination strategy uses an inference model based in learning; and (iii) the learning-based inference benefits from max-margin classifiers and global feedback

    Evaluating the word-expert approach for Named-Entity Disambiguation

    Full text link
    Named Entity Disambiguation (NED) is the task of linking a named-entity mention to an instance in a knowledge-base, typically Wikipedia. This task is closely related to word-sense disambiguation (WSD), where the supervised word-expert approach has prevailed. In this work we present the results of the word-expert approach to NED, where one classifier is built for each target entity mention string. The resources necessary to build the system, a dictionary and a set of training instances, have been automatically derived from Wikipedia. We provide empirical evidence of the value of this approach, as well as a study of the differences between WSD and NED, including ambiguity and synonymy statistics

    Context-Aware Hierarchical Online Learning for Performance Maximization in Mobile Crowdsourcing

    Full text link
    In mobile crowdsourcing (MCS), mobile users accomplish outsourced human intelligence tasks. MCS requires an appropriate task assignment strategy, since different workers may have different performance in terms of acceptance rate and quality. Task assignment is challenging, since a worker's performance (i) may fluctuate, depending on both the worker's current personal context and the task context, (ii) is not known a priori, but has to be learned over time. Moreover, learning context-specific worker performance requires access to context information, which may not be available at a central entity due to communication overhead or privacy concerns. Additionally, evaluating worker performance might require costly quality assessments. In this paper, we propose a context-aware hierarchical online learning algorithm addressing the problem of performance maximization in MCS. In our algorithm, a local controller (LC) in the mobile device of a worker regularly observes the worker's context, her/his decisions to accept or decline tasks and the quality in completing tasks. Based on these observations, the LC regularly estimates the worker's context-specific performance. The mobile crowdsourcing platform (MCSP) then selects workers based on performance estimates received from the LCs. This hierarchical approach enables the LCs to learn context-specific worker performance and it enables the MCSP to select suitable workers. In addition, our algorithm preserves worker context locally, and it keeps the number of required quality assessments low. We prove that our algorithm converges to the optimal task assignment strategy. Moreover, the algorithm outperforms simpler task assignment strategies in experiments based on synthetic and real data.Comment: 18 pages, 10 figure
    corecore