28 research outputs found
Benchmarking the Utility of -event Differential Privacy Mechanisms â When Baselines Become Mighty Competitors
The -event framework is the current standard for ensuring differential privacy on continuously monitored data streams. Following the proposition of-event differential privacy, various mechanisms to implement the framework were proposed. Their comparability in empirical studies is vital for both practitioners to choose a suitable mechanism and researchers to identify current limitations and propose novel mechanisms. By conducting a literature survey, we observe that the results of existing studies are hardly comparable and partially intrinsically inconsistent.
To this end, we formalize an empirical study of -event mechanisms by a four-tuple containing re-occurring elements found in our survey. We introduce requirements on these elements that ensure the comparability of experimental results. Moreover, we propose a benchmark that meets all requirements and establishes a new way to evaluate existing and newly proposed mechanisms. Conducting a large-scale empirical study, we gain valuable new insights into the strengths and weaknesses of existing mechanisms. An unexpected â yet explainable â result is a baseline supremacy, i.e., using one of the two baseline mechanisms is expected to deliver good or even the best utility. Finally, we provide guidelines for practitioners to select suitable mechanisms and improvement options for researchers to break the baseline supremacy
Benchmarking the Utility of w-Event Differential Privacy Mechanisms: When Baselines Become Mighty Competitors
The w-event framework is the current standard for ensuring differential privacy on continuously monitored data streams. Following the proposition of w-event differential privacy, various mechanisms to implement the framework are proposed. Their comparability in empirical studies is vital for both practitioners to choose a suitable mechanism, and researchers to identify current limitations and propose novel mechanisms. By conducting a literature survey, we observe that the results of existing studies are hardly comparable and partially intrinsically inconsistent.
To this end, we formalize an empirical study of w-event mechanisms by re-occurring elements found in our survey. We introduce requirements on these elements that ensure the comparability of experimental results. Moreover, we propose a benchmark that meets all requirements and establishes a new way to evaluate existing and newly proposed mechanisms. Conducting a large-scale empirical study, we gain valuable new insights into the strengths and weaknesses of existing mechanisms. An unexpected - yet explainable - result is a baseline supremacy, i.e., using one of the two baseline mechanisms is expected to deliver good or even the best utility. Finally, we provide guidelines for practitioners to select suitable mechanisms and improvement options for researchers
Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)
Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character\u27s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%
Visit Places on YourWay: A Skyline Approach in Time-Dependent Networks
Many people take the same path every day, such as taking a specific autobahn to get home from work. However, one needs to frequently divert from this path, e.g., to visit a Point of Interest (POI) from a category like the category of restaurants or ATMs. Usually, people want to minimize not only their overall travel cost but also their detour cost, i.e., one wants to return to the known path as fast as possible. Finding such a POI minimizing both costs efficiently is highly challenging in case one considers time-dependent road networks which are the case in real-world scenarios. For such road networks time decency means the time a user needs to traverse a road, heavily depends on the userâs arrival time on that road. Prior works have several limitations, such as assuming that travel costs are coming from a metric space and do not change over time. Both assumptions hardly match real-world requirements: Just think of traffic jams at the rush hour. To overcome these limitations, we study how to solve this problem considering time-dependent road networks relying on linear skylines. Our main contribution is an efficient algorithm called STACY to find all non-dominated paths. A large-scale empirical evaluation on real-world data reveals that STACY is accurate, efficient and effective in real-world settings
Swellfish Privacy: Exploiting Time-Dependent Relevance for Continuous Differential Privacy : Technical Report
Today, continuous publishing of differentially private query results is the de-facto standard. The challenge hereby is adding enough noise to satisfy a given privacy level, and adding as little noise as necessary to keep high data utility. In this context, we observe that privacy goals of individuals vary significantly over time. For instance, one might aim to hide whether one is on vacation only during school holidays. This observation, named time-dependent relevance, implies two effects which â properly exploited â allow to tune data utility. The effects are time-variant sensitivity (TEAS) and time-variant number of affected query results (TINAR). As todayâs DP frameworks, by design, cannot exploit these effects, we propose Swellfish privacy. There, with policy collections, individuals can specify combinations of time-dependent privacy goals. Then, query results are Swellfish-private, if the streams are indistinguishable with respect to such a collection.We propose two tools for designing Swellfish-private mechanisms, namely, temporal sensitivity and a composition theorem, each allowing to exploit one of the effects. In a realistic case study, we show empirically that exploiting both effects improves data utility by one to three orders of magnitude compared to state-of-the-art w-event DP mechanisms. Finally, we generalize the case study by showing how to estimate the strength of the effects for arbitrary use cases
On the Various Semantics of Similarity in Word Embedding Models
Finding similar words with the help of word embedding models has yielded meaningful results in many cases. However, the no-tion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding mod-els are meaningful. To do so, we analyze the statistical distribu-tion of similarity values systematically, in two series of experi-ments. The first one examines how the distribution of similarity values depends on the different embedding-model algorithms and parameters. The second one starts by showing that intuitive simi-larity thresholds do not exist. We then propose a method stating which similarity values actually are meaningful for a given em-bedding model. In more abstract terms, our insights should give way to a better understanding of the notion of similarity in em-bedding models and to more reliable evaluations of such models
Comparing Predictions of Object Movements
Estimating the future location of moving objects using different estimation models, such as linear or probabilistic models, has been investigated extensively. However, the location estimations of those models are generally not comparable. For instance, one model might return a position for some object, another one a Gaussian probability distribution, and a third one a uniform distribution. Similar issues arise for query answers. In this paper, we examine the question how estimations of different models can be compared. To do so, we propose a general model based on the central limit theorem. This allows handling different PDF-based approaches as well as models from the other groups (i.e., linear estimations) in a unified manner. Furthermore, we show how to inject privacy into the general model, a fundamental pre-requisite for user acceptance. Thus, we support well-known approaches like k-anonymity and spatial obfuscation. Based on our general model, we conduct a comprehensive experimental study considering a real-world road network; comparing models form different groups for the first time. Our results, for instance, reveal that estimation models based on individual velocity profiles are not necessarily better than models, which estimate the future location of objects only based on their direction. In more abstract terms, our general model allows comparison of estimation models that could not be compared before and gives way to build models that solve the privacy-accuracy challenge
CH-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights)
Data science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like âfreedomâ or âdemocracy,â both for todayâs society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical work loads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks
Exploiting subspace distance equalities in Highdimensional data for knn queries
Efficient k-nearest neighbor computation for high-dimensional data is an important, yet challenging task. The response times of stateof-the-art indexing approaches highly depend on factors like distribution of the data. For clustered data, such approaches are several factors faster than a sequential scan. However, if various dimensions contain uniform or Gaussian data they tend to be clearly outperformed by a simple sequential scan. Hence, we require for an approach generally delivering good response times, independent of the data distribution. As solution, we propose to exploit a novel concept to efficiently compute nearest neighbors. We name it sub-space distance equality, which aims at reducing the number of distance computations independent of the data distribution. We integrate knn computing algorithms into the Elf index structure allowing to study the sub-space distance equality concept in isolation and in combination with a main-memory optimized storage layout. In a large comparative study with twelve data sets, our results indicate that indexes based on sub-space distance equalities compute the least amount of distances. For clustered data, our Elf knn algorithm delivers at least a performance increase of factor two up to an increase of two magnitudes without losing the performance gain compared to sequential scans for uniform or Gaussian data