451 research outputs found

    CubiST: A New Algorithm for Improving the Performance of Ad-hoc OLAP Queries

    Get PDF
    Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with an efficient algorithm called CubiST, for evaluating ad-hoc OLAP queries on top of a relational data warehouse. We are focusing on a class of queries called cube queries, which generalize the data cube operator. CubiST represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the familiar view lattice to compute and materialize new views from existing views in some heuristic fashion. CubiST is the first OLAP algorithm that needs only one scan over the detailed data set and can efficiently answer any cube query without additional I/O when the ST fits into memory. We have implemented CubiST and our experiments have demonstrated significant improvements in performance and scalability over existing ROLAP/MOLAP approaches

    The scene superiority effect: object recognition in the context of natural scenes

    Get PDF
    Four experiments investigate the effect of background scene semantics on object recognition. Although past research has found that semantically consistent scene backgrounds can facilitate recognition of a target object, these claims have been challenged as the result of post-perceptual response bias rather than the perceptual processes of object recognition itself. The current study takes advantage of a paradigm from linguistic processing known as the Word Superiority Effect. Humans can better discriminate letters (e.g., D vs. K) in the context of a word (WORD vs. WORK) than in a non-word context (e.g., WROD vs. WROK) even when the context is non-predictive of the target identity. We apply this paradigm to objects in natural scenes, having subjects discriminate between objects in the context of scenes. Because the target objects were equally semantically consistent with any given scene and could appear in either semantically consistent or inconsistent contexts with equal probability, response bias could not lead to an apparent improvement in object recognition. The current study found a benefit to object recognition from semantically consistent backgrounds, and the effect appeared to be modulated by awareness of background scene semantics

    CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees

    Get PDF
    We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems

    Automatic physical database design : recommending materialized views

    Get PDF
    This work discusses physical database design while focusing on the problem of selecting materialized views for improving the performance of a database system. We first address the satisfiability and implication problems for mixed arithmetic constraints. The results are used to support the construction of a search space for view selection problems. We proposed an approach for constructing a search space based on identifying maximum commonalities among queries and on rewriting queries using views. These commonalities are used to define candidate views for materialization from which an optimal or near-optimal set can be chosen as a solution to the view selection problem. Using a search space constructed this way, we address a specific instance of the view selection problem that aims at minimizing the view maintenance cost of multiple materialized views using multi-query optimization techniques. Further, we study this same problem in the context of a commercial database management system in the presence of memory and time restrictions. We also suggest a heuristic approach for maintaining the views while guaranteeing that the restrictions are satisfied. Finally, we consider a dynamic version of the view selection problem where the workload is a sequence of query and update statements. In this case, the views can be created (materialized) and dropped during the execution of the workload. We have implemented our approaches to the dynamic view selection problem and performed extensive experimental testing. Our experiments show that our approaches perform in most cases better than previous ones in terms of effectiveness and efficiency

    Exploiting Data Skew for Improved Query Performance

    Full text link
    Analytic queries enable sophisticated large-scale data analysis within many commercial, scientific and medical domains today. Data skew is a ubiquitous feature of these real-world domains. In a retail database, some products are typically much more popular than others. In a text database, word frequencies follow a Zipf distribution with a small number of very common words, and a long tail of infrequent words. In a geographic database, some regions have much higher populations (and data measurements) than others. Current systems do not make the most of caches for exploiting skew. In particular, a whole cache line may remain cache resident even though only a small part of the cache line corresponds to a popular data item. In this paper, we propose a novel index structure for repositioning data items to concentrate popular items into the same cache lines. The net result is better spatial locality, and better utilization of limited cache resources. We develop a theoretical model for analyzing the cache behavior, and implement database operators that are efficient in the presence of skew. Our experiments on real and synthetic data show that exploiting skew can significantly improve in-memory query performance. In some cases, our techniques can speed up queries by over an order of magnitude

    Warehousing and Inventory Management in Dual Channel and Global Supply Chains

    Get PDF
    More firms are adopting the dual-channel supply chain business model where firms offer their products to customers using dual-channel sales (to offer the item to customers online and offline). The development periods of innovative products have been shortened, especially for high-tech companies, which leads to products with short life cycles. This means that companies need to put their new products on the market as soon as possible. The dual-channel supply chain is a perfect tool to increase the customer’s awareness of new products and to keep customers’ loyalty; firms can offer new products online to the customer faster compared to the traditional retail sales channel. The emergence of dual-channel firms was mainly driven by the expansion in internet use and the advances in information and manufacturing technologies. No existing research has examined inventory strategies, warehouse structure, operations, and capacity in a dual-channel context. Additionally, firms are in need to integrate their global suppliers base; where the lower parts costs compensate for the much higher procurement and cross-border costs; in their supply chain operations. The most common method used to integrate the global supplier base is the use of cross-dock, also known as Third Party Logistic (3PL). This study is motivated by real-world problem, no existing research has considered the optimization of cross-dock operations in terms of dock assignment, storage locations, inventory strategies, and lead time uncertainty in the context of a cross-docking system. In this dissertation, we first study the dual-channel warehouse in the dual-channel supply chain. One of the challenges in running the dual-channel warehouse is how to organize the warehouse and manage inventory to fulfill both online and offline (retailer) orders, where the orders from different channels have different features. A model for a dual-channel warehouse in a dual-channel supply chain is proposed, and a solution approach is developed in the case of deterministic and stochastic lead times. Ending up with numerical examples to highlight the model’s validity and its usefulness as a decision support tool. Second, we extend the first problem to include the global supplier and the cross-border time. The impact of global suppliers and the effect of the cross-border time on the dual-channel warehouse are studied. A cross-border dual-channel warehouse model in a dual-channel supply chain context is proposed. In addition to demand and lead time uncertainty, the cross-border time is included as stochastic parameter. Numerical results and managerial insights are also presented for this problem. Third, motivated by a real-world cross-dock problem, we perform a study at one of the big 3 automotive companies in the USA. The company faces the challenges of optimizing their operations and managing the items in the 3PL when introducing new products. Thus, we investigate a dock assignment problem that considers the dock capacity and storage space and a cross-dock layout. We propose an integrated model to combine the cross-dock assignment problem with cross-dock layout problem so that cross-dock operations can be coordinated effectively. In addition to lead time uncertainty, the cross-border time is included as stochastic parameter. Real case study and numerical results and managerial insights are also presented for this problem highlighting the cross-border effect. Solution methodologies, managerial insights, numerical analysis as well as conclusions and potential future study topics are also provided in this dissertation

    Business analytics in industry 4.0: a systematic review

    Get PDF
    Recently, the term “Industry 4.0” has emerged to characterize several Information Technology and Communication (ICT) adoptions in production processes (e.g., Internet-of-Things, implementation of digital production support information technologies). Business Analytics is often used within the Industry 4.0, thus incorporating its data intelligence (e.g., statistical analysis, predictive modelling, optimization) expert system component. In this paper, we perform a Systematic Literature Review (SLR) on the usage of Business Analytics within the Industry 4.0 concept, covering a selection of 169 papers obtained from six major scientific publication sources from 2010 to March 2020. The selected papers were first classified in three major types, namely, Practical Application, Reviews and Framework Proposal. Then, we analysed with more detail the practical application studies which were further divided into three main categories of the Gartner analytical maturity model, Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. In particular, we characterized the distinct analytics studies in terms of the industry application and data context used, impact (in terms of their Technology Readiness Level) and selected data modelling method. Our SLR analysis provides a mapping of how data-based Industry 4.0 expert systems are currently used, disclosing also research gaps and future research opportunities.The work of P. Cortez was supported by FCT - Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020. We would like to thank to the three anonymous reviewers for their helpful suggestions

    An Approach to Designing Clusters for Large Data Processing

    Get PDF
    Cloud computing is increasingly being adopted due to its cost savings and abilities to scale. As data continues to grow rapidly, an increasing amount of institutions are adopting non standard SQL clusters to address the storage and processing demands of large data. However, evaluating and modelling non SQL clusters presents many challenges. In order to address some of these challenges, this thesis proposes a methodology for designing and modelling large scale processing configurations that respond to the end user requirements. Firstly, goals are established for the big data cluster. In this thesis, we use performance and cost as our goals. Secondly, the data is transformed from relational data schema to an appropriate HBase schema. In the third step, we iteratively deploy different clusters. We then model the clusters and evaluate different topologies (size of instances, number of instances, number of clusters, etc.). We use HBase as the large data processing cluster and we evaluate our methodology on traffic data from a large city and on a distributed community cloud infrastructure

    A Taxonomy of Security Threats and Solutions for RFID Systems

    Get PDF
    RFID (Radio Frequency Identification) is a method of wireless data collection technology that uses RFID tags or transponders to electronically store and retrieve data. RFID tags are quickly replacing barcodes as the “identification system of choice” [1]. Since RFID devices are electronic devices, they can be hacked into by an outsider, and their data can be accessed or modified without the user knowing. New threats to RFID-enabled systems are always on the horizon. A systematic classification should be used to categorize these threats to help reduce confusion. This paper will look at the problem of security threats towards RFID systems, and provide a taxonomy for these threats
    corecore