6 research outputs found

    Efficient Algorithms for k-Regret Minimizing Sets

    Get PDF
    A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item\u27s attributes with a user\u27s weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors. We show that k-regret minimization is NP-Complete for all dimensions d>=3, settling an open problem from Chester et al. [VLDB 2014]. Our main algorithmic contributions are two approximation algorithms, both with provable guarantees, one based on coresets and another based on hitting sets. We perform extensive experimental evaluation of our algorithms, using both real-world and synthetic data, and compare their performance against the solution proposed in [VLDB 14]. The results show that our algorithms are significantly faster and scalable to much larger sets than the greedy algorithm of Chester et al. for comparable quality answers

    Faster Multidimensional Data Queries on Infrastructure Monitoring Systems

    Get PDF
    The analytics in online performance monitoring systems have often been limited due to the query performance of large scale multidimensional data. In this paper, we introduce a faster query approach using the bit-sliced index (BSI). Our study covers multidimensional grouping and preference top-k queries with the BSI, algorithms design, time complexity evaluation, and the query time comparison on a real-time production performance monitoring system. Our research work extended the BSI algorithms to cover attributes filtering and multidimensional grouping. We evaluated the query time with the single attribute, multiple attributes, feature filtering, and multidimensional grouping. To compare with the existing prior arts, we made a benchmarking comparison with the bitmap indexing, sequential scan, and collection streaming grouping. In the result of our experiments with large scale production data, the proposed BSI approach outperforms the existing prior arts: 3 times faster than the bitmap indexing approach on single attribute top-k queries, 10 times faster than the collection stream approach on the multidimensional grouping. While comparing with the baseline sequential scan approach, our proposed algorithm BSI approach outperforms the sequential scan approach with a factor of 10 on multiple attributes queries and a factor of 100 on single attribute queries. In the previous research, we had evaluated the BSI time complexity and space complexity on simulation data with various distributions, this research work further studied, evaluated, and concluded the BSI approach query performance with real production data

    Similarity-aware query refinement for data exploration

    Get PDF

    Algorithms for continuous queries: A geometric approach

    Get PDF
    <p>There has been an unprecedented growth in both the amount of data and the number of users interested in different types of data. Users often want to keep track of the data that match their interests over a period of time. A continuous query, once issued by a user, maintains the matching results for the user as new data (as well as updates to the existing data) continue to arrive in a stream. However, supporting potentially millions of continuous queries is a huge challenge. This dissertation addresses the problem of scalably processing a large number of continuous queries over a wide-area network. </p><p>Conceptually, the task of supporting distributed continuous queries can be divided into two components--event processing (computing the set of affected users for each data update) and notification dissemination (notifying the set of affected users). The first part of this dissertation focuses on event processing. Since interacting with large-scale data can easily frustrate and overwhelm the users, top-k queries have attracted considerable interest from the database community as they allow users to focus on the top-ranked results only. However, it is nearly impossible to find a set of common top-ranked data that everyone is interested in, therefore, users are allowed to specify their interest in different forms of preferences, such as personalized ranking function and range selection. This dissertation presents geometric frameworks, data structures, and algorithms for answering several types of preference queries efficiently. Experimental evaluations show that our approaches outperform the previous ones by orders of magnitude.</p><p>The second part of the dissertation presents comprehensive solutions to the problem of processing and notifying a large number of continuous range top-k queries across a wide-area network. Simple solutions include using a content-driven network to notify all continuous queries whose ranges contain the update (ignoring top-k), or using a server to compute only the affected continuous queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. This dissertation presents a geometric framework which allows the set of affected continuous queries to be described succinctly with messages that can be efficiently disseminated using content-driven networks. Fast algorithms are also developed to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all continuous queries. </p><p>The final component of this dissertation is the design of a wide-area dissemination network for continuous range queries. In particular, this dissertation addresses the problem of assigning users to servers in a wide-area content-based publish/subscribe system. A good assignment should consider both users' interests and locations, and balance multiple performance criteria including bandwidth, delay, and load balance. This dissertation presents a Monte Carlo approximation algorithm as well as a simple greedy algorithm. The Monte Carlo algorithm jointly considers multiple performance criteria to find a broker-subscriber assignment and provides theoretical performance guarantees. Using this algorithm as a yardstick, the greedy algorithm is also concluded to work well across a wide range of workloads.</p>Dissertatio

    Top-kk Preferences in High Dimensions

    No full text
    corecore