187 research outputs found

    Hiding Outliers in HighDimensional Data Spaces

    Get PDF
    Detecting outliers in high-dimensional data is crucial in many domains. Due to the curse of dimensionality, one typically does not detect outliers in the full space, but in subspaces of it. More specifically, since the number of subspaces is huge, the detection takes place in only some subspaces. In consequence, one might miss hidden outliers, i.e., outliers only detectable in certain subspaces. In this paper, we take the opposite perspective, which is of practical relevance as well, and study how to hide outliers in high-dimensional data spaces. We formally prove characteristics of hidden outliers. We also propose an algorithm to place them in the data. It focuses on the regions close to existing data objects and is more efficient than an exhaustive approach. In experiments, we both evaluate our formal results and show the usefulness of our algorithm using di↵erent subspace selection schemes, outlier detection methods and data sets

    Scenario Discovery via Rule Extraction

    Full text link
    Scenario discovery is the process of finding areas of interest, commonly referred to as scenarios, in data spaces resulting from simulations. For instance, one might search for conditions - which are inputs of the simulation model - where the system under investigation is unstable. A commonly used algorithm for scenario discovery is PRIM. It yields scenarios in the form of hyper-rectangles which are human-comprehensible. When the simulation model has many inputs, and the simulations are computationally expensive, PRIM may not produce good results, given the affordable volume of data. So we propose a new procedure for scenario discovery - we train an intermediate statistical model which generalizes fast, and use it to label (a lot of) data for PRIM. We provide the statistical intuition behind our idea. Our experimental study shows that this method is much better than PRIM itself. Specifically, our method reduces the number of simulations runs necessary by 75% on average

    Revealing the Suitability of Incentive Mechanisms for the Collaborative Creation of Structured Knowledge

    Get PDF

    Improved bibliographic reference parsing based on repeated patterns

    Get PDF
    uploaded by Plaz

    Auction-based traffic management : towards effective concurrent usage of road intersections

    Get PDF

    A combining approach to find all taxon names (FAT)

    Get PDF
    Most of the literature on natural history is hidden in millions of pages stacked up in our libraries. Various initiatives aim now at making these publications digitally accessible and searchable, applying xml-mark up technologies. The unique biological names play a crucial role to link content related to a particular taxon. Thus discovering and marking them up is extremely important. Since their manual extraction and markup is cumbersome and time-intensive, it needs be automated. In this paper, we present computational linguistics techniques and evaluate how they can help to extract taxonomic names auto-matically. We build on an existing approach for extraction of such names (Koning et al. 2005) and combine it with several other learning techniques. We apply them to the texts sequentially so that each technique can use the results from the preceding ones. In particular, we use structural rules, dynamic lexica with fuzzy lookups, and word-level language recognition. We use legacy documents from different sources and times as test bed for our evaluation. The experimental results for our combining approach (FAT) show greater than 99% precision and recall. They reveal the potential of computational linguis-tics techniques towards an automated markup of biosystematics publications

    A Comprehensive Study of k-Portfolios of Recent SAT Solvers

    Get PDF
    Hard combinatorial problems such as propositional satisfiability are ubiquitous. The holy grail are solution methods that show good performance on all problem instances. However, new approaches emerge regularly, some of which are complementary to existing solvers in that they only run faster on some instances but not on many others. While portfolios, i.e., sets of solvers, have been touted as useful, putting together such portfolios also needs to be efficient. In particular, it remains an open question how well portfolios can exploit the complementarity of solvers. This paper features a comprehensive analysis of portfolios of recent SAT solvers, the ones from the SAT Competitions 2020 and 2021. We determine optimal portfolios with exact and approximate approaches and study the impact of portfolio size k on performance. We also investigate how effective off-the-shelf prediction models are for instance-specific solver recommendations. One result is that the portfolios found with an approximate approach are as good as the optimal solution in practice. We also observe that marginal returns decrease very quickly with larger k, and our prediction models do not give way to better performance beyond very small portfolio sizes

    Deploying and Evaluating Pufferfish Privacy for Smart Meter Data (Technical Report)

    Get PDF
    Information hiding ensures privacy by transforming personalized data so that certain sensitive information cannot be inferred any more. One state-of-the-art information-hiding approach is the Pufferfish framework. It lets the users specify their privacy requirements as so-called discriminative pairs of secrets, and it perturbs data so that an adversary does not learn about the probability distribution of such pairs. However, deploying the framework on complex data such as time series requires application specific work. This includes a general definition of the representation of secrets in the data. Another issue is that the tradeoff between Pufferfish privacy and utility of the data is largely unexplored in quantitative terms. In this study, we quantify this tradeoff for smart meter data. Such data contains fine-grained time series of power-consumption data from private households. Disseminating such data in an uncontrolled way puts privacy at risk. We investigate how time series of energy consumption data must be transformed to facilitate specifying secrets that Pufferfish can use. We ensure the generality of our study by looking at different information-extraction approaches, such as re-identification and non-intrusive-appliance-load monitoring, in combination with a comprehensive set of secrets. Additionally, we provide quantitative utility results for a real-world application, the so-called local energy market
    • …
    corecore