4,997 research outputs found

    Distributed top-k aggregation queries at large

    Get PDF
    Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network

    Techniques for improving efficiency and scalability for the integration of information retrieval and databases

    Get PDF
    PhDThis thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with particular focuses on improving efficiency and scalability of integrated IR and DB technology (IR+DB). The main purpose of this study is to develop efficient and scalable techniques for supporting integrated IR and DB technology, which is a popular approach today for handling complex queries over text and structured data. Our specific interest in this thesis is how to efficiently handle queries over large-scale text and structured data. The work is based on a technology that integrates probability theory and relational algebra, where retrievals for text and data are to be expressed in probabilistic logical programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient processing of probabilistic logical programs, we proposed three optimization techniques that focus on aspects covered logical and physical layers, which include: scoring-driven query optimization using scoring expression, query processing with top-k incorporated pipeline, and indexing with relational inverted index. Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics of implied scoring functions of PRA expressions, so that efficient query execution plan can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and effectiveness so that to improve query response time, we studied methods for incorporating topk algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which can be used to support efficient probability estimation and aggregation as well as conventional relational operations. Experiments were carried out to investigate the performances of proposed techniques. Experimental results showed that the efficiency and scalability of an IR+DB prototype have been improved, while the system can handle queries efficiently on considerable large data sets for a number of IR tasks

    Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

    Full text link
    Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.Comment: 61 pages, 9 figures, 2 table

    10381 Summary and Abstracts Collection -- Robust Query Processing

    Get PDF
    Dagstuhl seminar 10381 on robust query processing (held 19.09.10 - 24.09.10) brought together a diverse set of researchers and practitioners with a broad range of expertise for the purpose of fostering discussion and collaboration regarding causes, opportunities, and solutions for achieving robust query processing. The seminar strove to build a unified view across the loosely-coupled system components responsible for the various stages of database query processing. Participants were chosen for their experience with database query processing and, where possible, their prior work in academic research or in product development towards robustness in database query processing. In order to pave the way to motivate, measure, and protect future advances in robust query processing, seminar 10381 focused on developing tests for measuring the robustness of query processing. In these proceedings, we first review the seminar topics, goals, and results, then present abstracts or notes of some of the seminar break-out sessions. We also include, as an appendix, the robust query processing reading list that was collected and distributed to participants before the seminar began, as well as summaries of a few of those papers that were contributed by some participants

    An Adaptive Mechanism for Accurate Query Answering under Differential Privacy

    Full text link
    We propose a novel mechanism for answering sets of count- ing queries under differential privacy. Given a workload of counting queries, the mechanism automatically selects a different set of "strategy" queries to answer privately, using those answers to derive answers to the workload. The main algorithm proposed in this paper approximates the optimal strategy for any workload of linear counting queries. With no cost to the privacy guarantee, the mechanism improves significantly on prior approaches and achieves near-optimal error for many workloads, when applied under (\epsilon, \delta)-differential privacy. The result is an adaptive mechanism which can help users achieve good utility without requiring that they reason carefully about the best formulation of their task.Comment: VLDB2012. arXiv admin note: substantial text overlap with arXiv:1103.136

    Rank-aware, Approximate Query Processing on the Semantic Web

    Get PDF
    Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion

    New IR & Ranking Algorithm for Top-K Keyword Search on Relational Databases ‘Smart Search’

    Get PDF
    Database management systems are as old as computers, and the continuous research and development in databases is huge and an interest of many database venders and researchers, as many researchers work in solving and developing new modules and frameworks for more efficient and effective information retrieval based on free form search by users with no knowledge of the structure of the database. Our work as an extension to previous works, introduces new algorithms and components to existing databases to enable the user to search for keywords with high performance and effective top-k results. Work intervention aims at introducing new table structure for indexing of keywords, which would help algorithms to understand the semantics of keywords and generate only the correct CN‟s (Candidate Networks) for fast retrieval of information with ranking of results according to user‟s history, semantics of keywords, distance between keywords and match of keywords. In which a three modules where developed for this purpose. We implemented our three proposed modules and created the necessary tables, with the development of a web search interface called „Smart Search‟ to test our work with different users. The interface records all user interaction with our „Smart Search‟ for analyses, as the analyses of results shows improvements in performance and effective results returned to the user. We conducted hundreds of randomly generated search terms with different sizes and multiple users; all results recorded and analyzed by the system were based on different factors and parameters. We also compared our results with previous work done by other researchers on the DBLP database which we used in our research. Our final result analysis shows the importance of introducing new components to the database for top-k keywords search and the performance of our proposed system with high effective results.نظم إدارة قواعد البيانات قديمة مثل أجيزة الكمبيوتر، و البحث والتطوير المستمر في قواعد بيانات ضخم و ىنالك اىتمام من العديد من مطوري قواعد البيانات والباحثين، كما يعمل العديد من الباحثين في حل وتطوير وحدات جديدة و أطر السترجاع المعمومات بطرق أكثر كفاءة وفعالية عمى أساس نموذج البحث الغير مقيد من قبل المستخدمين الذين ليس لدييم معرفة في بنية قاعدة البيانات. ويأتي عممنا امتدادا لألعمال السابقة، ويدخل الخوارزميات و مكونات جديدة لقواعد البيانات الموجودة لتمكين المستخدم من البحث عن الكممات المفتاحية )search Keyword )مع األداء العالي و نتائج فعالة في الحصول عمى أعمى ترتيب لمبيانات .)Top-K( وييدف ىذا العمل إلى تقديم بنية جديدة لفيرسة الكممات المفتاحية )Table Keywords Index ،)والتي من شأنيا أن تساعد الخوارزميات المقدمة في ىذا البحث لفيم معاني الكممات المفتاحية المدخمة من قبل المستخدم وتوليد فقط الشبكات المرشحة (s’CN (الصحيحة السترجاع سريع لممعمومات مع ترتيب النتائج وفقا ألوزان مختمفة مثل تاريخ البحث لممستخدم، ترتيب الكمات المفتاحية في النتائج والبعد بين الكممات المفتاحية في النتائج بالنسبة لما قام المستخدم بأدخالو. قمنا بأقتراح ثالث مكونات جديدة )Modules )وتنفيذىا من خالل ىذه االطروحة، مع تطوير واجية البحث عمى شبكة اإلنترنت تسمى "البحث الذكي" الختبار عممنا مع المستخدمين. وتتضمن واجية البحث مكونات تسجل تفاعل المستخدمين وتجميع تمك التفاعالت لمتحميل والمقارنة، وتحميالت النتائج تظير تحسينات في أداء استرجاع البينات و النتائج ذات صمة ودقة أعمى. أجرينا مئات عمميات البحث بأستخدام جمل بحث تم أنشائيا بشكل عشوائي من مختمف األحجام، باالضافة الى االستعانة بعدد من المستخدمين ليذه الغاية. واستندت جميع النتائج المسجمة وتحميميا بواسطة واجية البحث عمى عوامل و معايير مختمفة .وقمنا بالنياية بعمل مقارنة لنتائجنا مع االعمال السابقة التي قام بيا باحثون آخرون عمى نفس قاعدة البيانات (DBLP (الشييرة التي استخدمناىا في أطروحتنا. وتظير نتائجنا النيائية مدى أىمية أدخال بنية جديدة لفيرسة الكممات المفتاحية الى قواعد البيانات العالئقية، وبناء خوارزميات استنادا الى تمك الفيرسة لمبحث بأستخدام كممات مفتاحية فقط والحصول عمى نتائج أفضل ودقة أعمى، أضافة الى التحسن في وقت البحث
    corecore