15 research outputs found

    Scoop: An Adaptive Indexing Scheme for Stored Data in Sensor Networks

    Get PDF
    In this paper, we present the design of Scoop, a system for indexing and querying stored data in sensor networks. Scoop works by collecting statistics about the rate of queries and distribution of sensor readings over a sensor network, and uses those statistics to build an index that tells nodes where in the network to store their readings. Using this index, a users queries over that stored data can be answered efficiently, without flooding those queries throughout the network. This approach offers a substantial advantage over other solutions that either store all data externally on a basestation (requiring every reading to be collected from all nodes), or that store all data locally on the node that produced it (requiring queries to be flooded throughout the network). Our results, in fact, show that Scoop offers a factor of four improvement over existing techniques in a real implementation on a 64-node mote-based sensor network. These results also show that Scoop is able to efficciently adapt to changes in the distribution and rates of data and queries

    A formal analysis of why heuristic functions work

    Get PDF
    AbstractMany optimization problems in computer science have been proven to be NP-hard, and it is unlikely that polynomial-time algorithms that solve these problems exist unless P=NP. Alternatively, they are solved using heuristics algorithms, which provide a sub-optimal solution that, hopefully, is arbitrarily close to the optimal. Such problems are found in a wide range of applications, including artificial intelligence, game theory, graph partitioning, database query optimization, etc. Consider a heuristic algorithm, A. Suppose that A could invoke one of two possible heuristic functions. The question of determining which heuristic function is superior, has typically demanded a yes/no answer—one which is often substantiated by empirical evidence. In this paper, by using Pattern Classification Techniques (PCT), we propose a formal, rigorous theoretical model that provides a stochastic answer to this problem. We prove that given a heuristic algorithm, A, that could utilize either of two heuristic functions H1 or H2 used to find the solution to a particular problem, if the accuracy of evaluating the cost of the optimal solution by using H1 is greater than the accuracy of evaluating the cost using H2, then H1 has a higher probability than H2 of leading to the optimal solution. This unproven conjecture has been the basis for designing numerous algorithms such as the A* algorithm, and its variants. Apart from formally proving the result, we also address the corresponding database query optimization problem that has been open for at least two decades. To validate our proofs, we report empirical results on database query optimization techniques involving a few well-known histogram estimation methods

    Robust Query Optimization Methods With Respect to Estimation Errors: A Survey

    Get PDF
    International audienceThe quality of a query execution plan chosen by a Cost-Based Optimizer (CBO) depends greatly on the estimation accuracy of input parameter values. Many research results have been produced on improving the estimation accuracy, but they do not work for every situation. Therefore, "robust query optimization" was introduced, in an effort to minimize the sub-optimality risk by accepting the fact that estimates could be inaccurate. In this survey, we aim to provide an overview of robust query optimization methods by classifying them into different categories, explaining the essential ideas, listing their advantages and limitations, and comparing them with multiple criteria

    Join sizes, urn models and normal limiting distributions

    Get PDF
    AbstractWe study some parameters of relational databases (sizes of relations obtained by a join) that can be described by generating functions on three variables, of the kind ϕ(x, y, z)d. We modelize these parameters by suitable urn models and give conditions under which they asymptotically follow a gaussian distribution

    Characteristic sets profile features: Estimation and application to SPARQL query planning

    Get PDF
    RDF dataset profiling is the task of extracting a formal representation of a dataset’s features. Such features may cover various aspects of the RDF dataset ranging from information on licensing and provenance to statistical descriptors of the data distribution and its semantics. In this work, we focus on the characteristics sets profile features that capture both structural and semantic information of an RDF dataset, making them a valuable resource for different downstream applications. While previous research demonstrated the benefits of characteristic sets in centralized and federated query processing, access to these fine-grained statistics is taken for granted. However, especially in federated query processing, computing this profile feature is challenging as it can be difficult and/or costly to access and process the entire data from all federation members. We address this shortcoming by introducing the concept of a profile feature estimation and propose a sampling-based approach to generate estimations for the characteristic sets profile feature. In addition, we showcase the applicability of these feature estimations in federated querying by proposing a query planning approach that is specifically designed to leverage these feature estimations. In our first experimental study, we intrinsically evaluate our approach on the representativeness of the feature estimation. The results show that even small samples of just 0.5% of the original graph’s entities allow for estimating both structural and statistical properties of the characteristic sets profile features. Our second experimental study extrinsically evaluates the estimations by investigating their applicability in our query planner using the well-known FedBench benchmark. The results of the experiments show that the estimated profile features allow for obtaining efficient query plans

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Optimization of Regular Path Queries in Graph Databases

    Get PDF
    Regular path queries offer a powerful navigational mechanism in graph databases. Recently, there has been renewed interest in such queries in the context of the Semantic Web. The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. While eminently useful, such queries are difficult to optimize and evaluate efficiently, however. We design and implement a cost-based optimizer we call Waveguide for SPARQL queries with property paths. Waveguide builds a query planwhich we call a waveplan (WP)which guides the query evaluation. There are numerous choices in the con- struction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude with the best plan often offering excellent performance. A WPs costs can be estimated, which opens the way to cost-based optimization. We demonstrate that Waveguide properly subsumes existing techniques and that the new plans it adds are relevant. We analyze the effective plan space which is enabled by Waveguide and design an efficient enumerator for it. We implement a pro- totype of a Waveguide cost-based optimizer on top of an open-source relational RDF store. Finally, we perform a comprehensive performance study of the state of the art for evaluation of SPARQL property paths and demonstrate the significant performance gains that Waveguide offers

    CPUアーキテクチャを考慮した性能モデルの導入によるデータベース・クエリ最適化のためのコスト計算の精度向上

    Get PDF
    Non-volatile memory is applied not only to storage subsystems but also to the main memory of computers to improve performance and increase capacity. In the near future, some in-memory database systems will use non-volatile main memory as a durable medium instead of using existing storage devices, such as hard disk drives or solid-state drives. In addition, cloud computing is gaining more attention, and users are increasingly demanding performance improvement. In particular, the Database-as-a-Service (DBaaS) market is rapidly expanding. Attempts to improve database performance have led to the development of in-memory databases using non-volatile memory as a durable database medium rather than existing storage devices. For such in-memory database systems, the cost of memory access instead of Input/Output (I/O) processing decreases, and the Central Processing Unit (CPU) cost increases relative to the most suitable access path selected for a database query. Therefore, a high-precision cost calculation method for query execution is required. In particular, when the database system cannot select the most appropriate join method, the query execution time increases. Moreover, in the cloud computing environment the CPU architecture of different physical servers may be of different generations. The cost model is also required to be capable of application to different generation CPUs through minor modification in order not to increase database administrator\u27s extra duties. To improve the accuracy of the cost calculation, a cost calculation method based on CPU architecture using statistical information measured by a performance monitor embedded within the CPU (hereinafter called measurement-based cost calculation method) is proposed, and the accuracy of estimating the intersection (hereinafter called cross point) of cost calculation formulas for join methods is evaluated. In this calculation method, we concentrate on the instruction issuing part in the instruction pipeline, inside the CPU architecture. The cost of database search processing is classified into three types, data cache access, instruction cache miss penalty and branch misprediction penalty, and for each a cost calculation formula is constructed. Moreover, each cost calculation formula models the tendency between the statistical information measured by the performance monitor embedded within the CPU and the selectivity of the table while executing join operations. The statistical information measured by the performance monitor is information such as the number of executed instructions and the number of cache hits. In addition, for each element separated into elements repeatedly appearing in the access path of the join, cost calculation formulas are formed into parts, and the cost is calculated combining the parts for an arbitrary number of join tables. First, to investigate the feasibility of the proposed method, a cost formula for a two-table join was constructed using a large database, 100 GB of the TPC Benchmark(TM) H database. The accuracy of the cost calculation was evaluated by comparing the measured cross point with the estimated cross point. The results indicated that the difference between the predicted cross point and the measured cross point was less than 0.1% selectivity and was reduced by 71% to 94% compared with the difference between the cross point obtained by the conventional method and the measured cross point. Therefore, the proposed cost calculation method can improve the accuracy of join cost calculation. Then, to reduce the operating time of the database administration, the cost calculation formulawas constructed under the condition that the database for measuring the statistical value was reduced to a small scale (5 GB). The accuracy of cost calculations was also evaluated when joining three or more tables. As a result, the difference between the predicted cross point and the measured cross point was reduced by 74% to 95% compared with the difference between the cross point obtained by the conventional method and the measured cross point. It means the proposed method can improve the accuracy of cost calculation. Finally, a method is also proposed for updating the cost calculation formula using the measurement-based cost calculation method to support a CPU with architecture from another generation without requiring re-measurement of the statistical information of that CPU. Our approach focuses on reflecting architectural changes, such as cache size and associativity, memory latency, and branch misprediction penalty, in the components of the cost calcula-tion formulas. The updated cost calculation formulas estimated the cost of joining different generation-based CPUs accurately in 66% of the test cases. In conclusion, the in-memory database system using the proposed cost calculation method can select the best join method and can be applied to a database system with CPUs from different generations.首都大学東京, 2019-03-25, 博士(工学)首都大学東

    Adaptive indexing scheme for stored data in sensor networks

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 53-56).We present the design of Scoop, a system that is designed to efficiently store and query relational data collected by nodes in a bandwidth-constrained sensor network. Sensor networks allow remote environments to be monitored at very fine levels of granularity; often such monitoring deployments generate large amounts of data which may be impractical to collect due to bandwidth limitations, but which can easily stored in-network for some period of time. Existing approaches to querying stored data in sensor networks have typically assumed that all data either is stored locally, at the node that produced it, or is hashed to some location in the network using a predefined uniform hash function. These two approaches are at the extremes of a trade-off between storage and query costs. In the former case, the costs of storing data ate low, since no transmissions are required, but queries must flood the entire network. In the latter case, some queries can be executed efficiently by using the hash function to find the nodes of interest, but storage is expensive as readings must be transmitted to some (likely far away) location in the network. In contrast, Scoop monitors changes in the distribution of sensor readings, queried values, and network connectivity to determine the best location to store data. We formulate this as an optimization problem and present a practical algorithm that solves this problem in Scoop. We have built a complete implementation of Scoop for TinyOS mote [1] sensor network hardware and evaluated its performance on a 60-node testbed and in the TinyOS simulator, TOSSIM. Our results show that Scoop not only provides substantial performance benefits over alternative approaches on a range of data sets, but is also able to efficiently adapt to changes in the distribution and rates of data and queries.by Thomer M. Gil.S.M

    A framework for improving the quality of management information

    Get PDF
    Good management information is critical to the success of any organisation. Without proper information, the organisation will starve and management will steer it towards its own destruction. Although management usually receives a constant flow of information every day, it does not necessarily serve its purpose. The information could be useful and contribute to good decision making. It could also be wrong, which could contribute to poor decisions and failure. Finally it could be available but inappropriate to the management of the organisation. Computer systems produce most of the management information in organisations. There are a variety of types of computer equipment and applications that could make a significant contribution to improve the quality of the information produced in an organisation. Improving the quality of the information could enhance the decisions and actions of management and therefore improve the results of the organisation. The objective of this study is to develop a framework for enabling managers to understand the attributes of quality information, and to identify appropriate computer equipment and applications to enhance the quality of information. The hypothesis is as follows: "Quality information is that information that can have a decisive impact on the decisions and action of the decision maker. It is feasible to identify the attributes of quality information. This research clarifies the main and supportive attributes of quality information. Computer equipment and applications, collectively known as computer tools, used and managed in an organisation contribute to enhance or impair quality information. These tools, and the role they can play to produce quality information, will be explained in this study. " To support the hypothesis the study is conducted in three phases. Phase 1 will establish the attributes of quality information. It commences by considering the historical development of computers and information. It explores the different driving forces that produced the technology of today and tomorrow. It then explores the nature and characteristics of information to develop an appreciation for the complexity and intricacies of information. This phase concludes by identifying the attributes of quality information: relevancy, accuracy, timeliness and comprehensibility. Phase 2 will assess the contribution of computer tools to produce quality information. The contribution that the computer equipment and applications make to enhance the attributes of quality information is described and evaluated. Computer tools are defined, and a method of assessing their contribution to enhance quality is designed and applied. The phase concludes with a summary of the contribution that the tools could make to enhance specific information components. The final phase produces a framework to evaluate the production of quality information. With a clear understanding on the one hand of the attributes of quality information, and, on the other hand of the contribution that different computer equipment and applications can make to improve the quality of information, a framework is developed to help managers to identify appropriate technology to improve the information on which they base their decisions. This framework could be used by information managers to improve the effectiveness of management's actions and decisions. The results of the study, it is submitted, support the stated hypothesis and add benefit to the practical application of information management.Dissertation (MCom)--University of Pretoria, 2009.AccountingUnrestricte
    corecore