Search CORE

914 research outputs found

Clustering-Based Materialized View Selection in Data Warehouses

Author: A. Shukla
A.F. Cardenas
H. Gupta
H. Gupta
J.R. Smith
Jonathan Goldstein
S. Rizzi
S.B. Yao
X. Baril
Publication venue
Publication date: 01/01/2006
Field of study

Materialized view selection is a non-trivial task. Hence, its complexity must be reduced. A judicious choice of views must be cost-driven and influenced by the workload experienced by the system. In this paper, we propose a framework for materialized view selection that exploits a data mining technique (clustering), in order to determine clusters of similar queries. We also propose a view merging algorithm that builds a set of candidate views, as well as a greedy process for selecting a set of views to materialize. This selection is based on cost models that evaluate the cost of accessing data using views and the cost of storing these views. To validate our strategy, we executed a workload of decision-support queries on a test data warehouse, with and without using our strategy. Our experimental results demonstrate its efficiency, even when storage space is limited

arXiv.org e-Print Archive

CiteSeerX

Crossref

HAL Descartes

Multi agent collaborative search based on Tchebycheff decomposition

Author: Vasile Massimiliano
Zuiani Federico
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/03/2013
Field of study

This paper presents a novel formulation of Multi Agent Collaborative Search, for multi-objective optimization, based on Tchebycheff decomposition. A population of agents combines heuristics that aim at exploring the search space both globally (social moves) and in a neighborhood of each agent (individualistic moves). In this novel formulation the selection process is based on a combination of Tchebycheff scalarization and Pareto dominance. Furthermore, while in the previous implementation, social actions were applied to the whole population of agents and individualistic actions only to an elite sub-population, in this novel formulation this mechanism is inverted. The novel agent-based algorithm is tested at first on a standard benchmark of difficult problems and then on two specific problems in space trajectory design. Its performance is compared against a number of state-of-the-art multi objective optimization algorithms. The results demonstrate that this novel agent-based search has better performance with respect to its predecessor in a number of cases and converges better than the other state-of-the-art algorithms with a better spreading of the solutions

Crossref

University of Strathclyde Institutional Repository

An LSH Index for Computing Kendall's Tau over Top-k Lists

Author: Michel Sebastian
Pal Koninika
Publication venue
Publication date: 01/01/2014
Field of study

We consider the problem of similarity search within a set of top-k lists under the Kendall's Tau distance function. This distance describes how related two rankings are in terms of concordantly and discordantly ordered items. As top-k lists are usually very short compared to the global domain of possible items to be ranked, creating an inverted index to look up overlapping lists is possible but does not capture tight enough the similarity measure. In this work, we investigate locality sensitive hashing schemes for the Kendall's Tau distance and evaluate the proposed methods using two real-world datasets.Comment: 6 pages, 8 subfigures, presented in Seventeenth International Workshop on the Web and Databases (WebDB 2014) co-located with ACM SIGMOD201

arXiv.org e-Print Archive

MPG.PuRe

Rapidash: Efficient Constraint Discovery via Rapid Verification

Author: Deep Shaleen
Fariha Anna
Floratou Avrilia
Liu Zifan
Psallidas Fotis
Tiwari Ashish
Publication venue
Publication date: 21/09/2023
Field of study

Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. Given their significance, there has been considerable research interest in achieving fast verification and discovery of exact DCs within the database community. Despite the significant advancements in the field, prior work exhibits notable limitations when confronted with large-scale datasets. The current state-of-the-art exact DC verification algorithm demonstrates a quadratic (worst-case) time complexity relative to the dataset's number of rows. In the context of DC discovery, existing methodologies rely on a two-step algorithm that commences with an expensive data structure-building phase, often requiring hours to complete even for datasets containing only a few million rows. Consequently, users are left without any insights into the DCs that hold on their dataset until this lengthy building phase concludes. In this paper, we introduce Rapidash, a comprehensive framework for DC verification and discovery. Our work makes a dual contribution. First, we establish a connection between orthogonal range search and DC verification. We introduce a novel exact DC verification algorithm that demonstrates near-linear time complexity, representing a theoretical improvement over prior work. Second, we propose an anytime DC discovery algorithm that leverages our novel verification algorithm to gradually provide DCs to users, eliminating the need for the time-intensive building phase observed in prior work. To validate the effectiveness of our algorithms, we conduct extensive evaluations on four large-scale production datasets. Our results reveal that our DC verification algorithm achieves up to 40 times faster performance compared to state-of-the-art approaches.Comment: comments and suggestions are welcome

arXiv.org e-Print Archive

Designing algorithms for big graph datasets : a study of computing bisimulation and joins

Author: Luo Y.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2015
Field of study

Repository TU/e

Pure OAI Repository

Recommended from our members

Physical Plan Instrumentation in Databases: Mechanisms and Applications

Author: Psallidas Fotis
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Database management systems (DBMSs) are designed with the goal set to compile SQL queries to physical plans that, when executed, provide results to the SQL queries. Building on this functionality, an ever-increasing number of application domains (e.g., provenance management, online query optimization, physical database design, interactive data profiling, monitoring, and interactive data visualization) seek to operate on how queries are executed by the DBMS for a wide variety of purposes ranging from debugging and data explanation to optimization and monitoring. Unfortunately, DBMSs provide little, if any, support to facilitate the development of this class of important application domains. The effect is such that database application developers and database system architects either rewrite the database internals in ad-hoc ways; work around the SQL interface, if possible, with inevitable performance penalties; or even build new databases from scratch only to express and optimize their domain-specific application logic over how queries are executed. To address this problem in a principled manner in this dissertation, we introduce a prototype DBMS, namely, Smoke, that exposes instrumentation mechanisms in the form of a framework to allow external applications to manipulate physical plans. Intuitively, a physical plan is the underlying representation that DBMSs use to encode how a SQL query will be executed, and providing instrumentation mechanisms at this representation level allows applications to express and optimize their logic on how queries are executed. Having such an instrumentation-enabled DBMS in-place, we then consider how to express and optimize applications that rely their logic on how queries are executed. To best demonstrate the expressive and optimization power of instrumentation-enabled DBMSs, we express and optimize applications across several important domains including provenance management, interactive data visualization, interactive data profiling, physical database design, online query optimization, and query discovery. Expressivity-wise, we show that Smoke can express known techniques, introduce novel semantics on known techniques, and introduce new techniques across domains. Performance-wise, we show case-by-case that Smoke is on par with or up-to several orders of magnitudes faster than state-of-the-art imperative and declarative implementations of important applications across domains. As such, we believe our contributions provide evidence and form the basis towards a class of instrumentation-enabled DBMSs with the goal set to express and optimize applications across important domains with core logic over how queries are executed by DBMSs

Columbia University Academic Commons