1,821 research outputs found

    Schism: a Workload-Driven Approach to Database Replication and Partitioning

    Get PDF
    We present Schism, a novel workload-aware approach for database partitioning and replication designed to improve scalability of shared-nothing distributed databases. Because distributed transactions are expensive in OLTP settings (a fact we demonstrate through a series of experiments), our partitioner attempts to minimize the number of distributed transactions, while producing balanced partitions. Schism consists of two phases: i) a workload-driven, graph-based replication/partitioning phase and ii) an explanation and validation phase. The first phase creates a graph with a node per tuple (or group of tuples) and edges between nodes accessed by the same transaction, and then uses a graph partitioner to split the graph into k balanced partitions that minimize the number of cross-partition transactions. The second phase exploits machine learning techniques to find a predicate-based explanation of the partitioning strategy (i.e., a set of range predicates that represent the same replication/partitioning scheme produced by the partitioner). The strengths of Schism are: i) independence from the schema layout, ii) effectiveness on n-to-n relations, typical in social network databases, iii) a unified and fine-grained approach to replication and partitioning. We implemented and tested a prototype of Schism on a wide spectrum of test cases, ranging from classical OLTP workloads (e.g., TPC-C and TPC-E), to more complex scenarios derived from social network websites (e.g., Epinions.com), whose schema contains multiple n-to-n relationships, which are known to be hard to partition. Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.Quanta Computer (Firm) (T-Party Project

    Low overhead concurrency control for partitioned main memory databases

    Get PDF
    Database partitioning is a technique for improving the performance of distributed OLTP databases, since "single partition" transactions that access data on one partition do not need coordination with other partitions. For workloads that are amenable to partitioning, some argue that transactions should be executed serially on each partition without any concurrency at all. This strategy makes sense for a main memory database where there are no disk or user stalls, since the CPU can be fully utilized and the overhead of traditional concurrency control, such as two-phase locking, can be avoided. Unfortunately, many OLTP applications have some transactions which access multiple partitions. This introduces network stalls in order to coordinate distributed transactions, which will limit the performance of a database that does not allow concurrency. In this paper, we compare two low overhead concurrency control schemes that allow partitions to work on other transactions during network stalls, yet have little cost in the common case when concurrency is not needed. The first is a light-weight locking scheme, and the second is an even lighter-weight type of speculative concurrency control that avoids the overhead of tracking reads and writes, but sometimes performs work that eventually must be undone. We quantify the range of workloads over which each technique is beneficial, showing that speculative concurrency control generally outperforms locking as long as there are few aborts or few distributed transactions that involve multiple rounds of communication. On a modified TPC-C benchmark, speculative concurrency control can improve throughput relative to the other schemes by up to a factor of two.National Science Foundation (U.S.). (Grant number IIS-0704424)National Science Foundation (U.S.). (Grant number IIS-0845643

    Visual and computational analysis of structure-activity relationships in high-throughput screening data

    Get PDF
    Novel analytic methods are required to assimilate the large volumes of structural and bioassay data generated by combinatorial chemistry and high-throughput screening programmes in the pharmaceutical and agrochemical industries. This paper reviews recent work in visualisation and data mining that can be used to develop structure-activity relationships from such chemical/biological datasets

    Data mining techniques application for prediction in OLAP cube

    Get PDF
    Data warehouses represent collections of data organized to support a process of decision support, and provide an appropriate solution for managing large volumes of data. OLAP online analytics is a technology that complements data warehouses to make data usable and understandable by users, by providing tools for visualization, exploration, and navigation of data-cubes. On the other hand, data mining allows the extraction of knowledge from data with different methods of description, classification, explanation and prediction. As part of this work, we propose new ways to improve existing approaches in the process of decision support. In the continuity of the work treating the coupling between the online analysis and data mining to integrate prediction into OLAP, an approach based on automatic learning with Clustering is proposed in order to partition an initial data cube into dense sub-cubes that could serve as a learning set to build a prediction model. The technique of data mining by regression trees is then applied for each sub-cube to predict the value of a cell

    Integration of Data Mining and Data Warehousing: a practical methodology

    Get PDF
    The ever growing repository of data in all fields poses new challenges to the modern analytical systems. Real-world datasets, with mixed numeric and nominal variables, are difficult to analyze and require effective visual exploration that conveys semantic relationships of data. Traditional data mining techniques such as clustering clusters only the numeric data. Little research has been carried out in tackling the problem of clustering high cardinality nominal variables to get better insight of underlying dataset. Several works in the literature proved the likelihood of integrating data mining with warehousing to discover knowledge from data. For the seamless integration, the mined data has to be modeled in form of a data warehouse schema. Schema generation process is complex manual task and requires domain and warehousing familiarity. Automated techniques are required to generate warehouse schema to overcome the existing dependencies. To fulfill the growing analytical needs and to overcome the existing limitations, we propose a novel methodology in this paper that permits efficient analysis of mixed numeric and nominal data, effective visual data exploration, automatic warehouse schema generation and integration of data mining and warehousing. The proposed methodology is evaluated by performing case study on real-world data set. Results show that multidimensional analysis can be performed in an easier and flexible way to discover meaningful knowledge from large datasets

    A workload‑driven approach for view selection in large dimensional datasets

    Get PDF
    The information explosion the world has witnessed in the last two decades has forced businesses to adopt a data-driven culture for them to be competitive. These data-driven businesses have access to countless sources of information, and face the challenge of making sense of overwhelming amounts of data in a efficient and reliable manner, which implies the execution of read-intensive operations. In the context of this challenge, a framework for the dynamic read-optimization of large dimensional datasets has been designed, and on top of it a workload-driven mechanism for automatic materialized view selection and creation has been developed. This paper presents an extensive description of this mechanism, along with a proof-of-concept implementation of it and its corresponding performance evaluation. Results show that the proposed mechanism is able to derive a limited but comprehensive set of views leading to a drop in query latency ranging from 80% to 99.99% at the expense of 13% of the disk space used by the base dataset. This way, the devised mechanism enables speeding up query execution by building materialized views that match the actual demand of query workloads

    Online fulfillment: f-warehouse order consolidation and bops store picking problems

    Get PDF
    Fulfillment of online retail orders is a critical challenge for retailers since the legacy infrastructure and control methods are ill suited for online retail. The primary performance goal of online fulfillment is speed or fast fulfillment, requiring received orders to be shipped or ready for pickup within a few hours. Several novel numerical problems characterize fast fulfillment operations and this research solves two such problems. Order fulfillment warehouses (F-Warehouses) are a critical component of the physical internet behind online retail supply chains. Two key distinguishing features of an F-Warehouse are (i) Explosive Storage Policy – A unique item can be stored simultaneously in multiple bin locations dispersed through the warehouse, and (ii) Commingled Bins – A bin can stock several different items simultaneously. The inventory dispersion profile of an item is therefore temporal and non-repetitive. The order arrival process is continuous, and each order consists of one or more items. From the set of pending orders, efficient picking lists of 10-15 items are generated. A picklist of items is collected in a tote, which is then transported to a packaging station, where items belonging to the same order are consolidated into a shipment package. There are multiple such stations. This research formulates and solves the order consolidation problem. At any time, a batch of totes are to be processed through several available order packaging stations. Tote assignment to a station will determine whether an order will be shipped in a single package or multiple packages. Reduced shipping costs are a key operational goal of an online retailer, and the number of packages is a determining factor. The decision variable is which station a tote should be assigned to, and the performance objective is to minimize the number of packages and balance the packaging station workload. This research first formulates the order consolidation problem as a mixed integer programming model, and then develops two fast heuristics (#1 and #2) plus two clustering algorithm derived solutions. For small problems, the heuristic #2 is on average within 4.1% of the optimal solution. For larger problems heuristic #2 outperforms all other algorithms. Performance behavior of heuristic #2 is further studied as a function of several characteristics. S-Strategy fulfillment is a store-based solution for fulfilling online customer orders. The S-Strategy is driven by two key motivations, first, retailers have a network of stores where the inventory is already dispersed, and second, the expectation is that forward positioned inventory could be faster and more economical than a warehouse based F-Strategy. Orders are picked from store inventory and then the customer picks up from the store (BOPS). A BOPS store has two distinguishing features (i) In addition to shelf stock, the layout includes a space constrained back stock of selected items, and (ii) a set of dedicated pickers who are scheduled to fulfill orders. This research solves two BOFS related problems: (i) Back stock strategy: Assignment of items located in the back stock and (ii) Picker scheduling: Effect of numbers of picker and work hours. A continuous flow of incoming orders is assumed for both problems and the objective is fulfillment time and labor cost minimization. For the back-stock problem an assignment rule based on order frequency, forward location and order basket correlations achieves a 17.6% improvement over a no back-stock store, while a rule based only on order frequency achieves a 12.4 % improvement. Additional experiments across a range of order baskets are reported
    • …
    corecore