5,307 research outputs found
Qd-tree: Learning Data Layouts for Big Data Analytics
Corporations today collect data at an unprecedented and accelerating scale,
making the need to run queries on large datasets increasingly important.
Technologies such as columnar block-based data organization and compression
have become standard practice in most commercial database systems. However, the
problem of best assigning records to data blocks on storage is still open. For
example, today's systems usually partition data by arrival time into row
groups, or range/hash partition the data based on selected fields. For a given
workload, however, such techniques are unable to optimize for the important
metric of the number of blocks accessed by a query. This metric directly
relates to the I/O cost, and therefore performance, of most analytical queries.
Further, they are unable to exploit additional available storage to drive this
metric down further.
In this paper, we propose a new framework called a query-data routing tree,
or qd-tree, to address this problem, and propose two algorithms for their
construction based on greedy and deep reinforcement learning techniques.
Experiments over benchmark and real workloads show that a qd-tree can provide
physical speedups of more than an order of magnitude compared to current
blocking schemes, and can reach within 2X of the lower bound for data skipping
based on selectivity, while providing complete semantic descriptions of created
blocks.Comment: ACM SIGMOD 202
A Workload-Specific Memory Capacity Configuration Approach for In-Memory Data Analytic Platforms
We propose WSMC, a workload-specific memory capacity configuration approach
for the Spark workloads, which guides users on the memory capacity
configuration with the accurate prediction of the workload's memory requirement
under various input data size and parameter settings.First, WSMC classifies the
in-memory computing workloads into four categories according to the workloads'
Data Expansion Ratio. Second, WSMC establishes a memory requirement prediction
model with the consideration of the input data size, the shuffle data size, the
parallelism of the workloads and the data block size. Finally, for each
workload category, WSMC calculates the shuffle data size in the prediction
model in a workload-specific way. For the ad-hoc workload, WSMC can profile its
Data Expansion Ratio with small-sized input data and decide the category that
the workload falls into. Users can then determine the accurate configuration in
accordance with the corresponding memory requirement prediction.Through the
comprehensive evaluations with SparkBench workloads, we found that, contrasting
with the default configuration, configuration with the guide of WSMC can save
over 40% memory capacity with the workload performance slight degradation (only
5%), and compared to the proper configuration found out manually, the
configuration with the guide of WSMC leads to only 7% increase in the memory
waste with the workload's performance slight improvement (about 1%
Dynamic reactive assignment of tasks in real-time automated guided vehicle environments with potential interruptions
An efficient management of production plants has to consider several external and internal factors, such as potential interruptions of the ongoing processes. Automated guided vehicles (AGVs) are becoming a widespread technology that offers many advantages. These AGVs can perform complex tasks in an autonomous way. However, an inefficient schedule of the tasks assigned to an AGV can suffer from unwanted interruptions and idle times, which in turn will affect the total time required by the AGV to complete its assigned tasks. In order to avoid these issues, this paper proposes a heuristic-based approach that: (i) makes use of a delay matrix to estimate circuit delays for different daily times; (ii) employs these estimates to define an initial itinerary of tasks for an AGV; and (iii) dynamically adjusts the initial agenda as new information on actual delays is obtained by the system. The objective is to minimize the total time required for the AGV to complete all the assigned tasks, taking into account situations that generate unexpected disruptions along the circuits that the AGV follows. In order to test and validate the proposed approach, a series of computational experiments utilizing real-life data are carried out. These experiments allow us to measure the improvement gap with respect to the former policy used by the system managers.This work has been partially supported by the Spanish Ministry of Industry, Commerce and Tourism (AEI-010500-2021b-54), the EU Comission (HORIZON-CL4-2021-TWIN-TRANSITION-01-07, 101057294 AIDEAS), and the Generalitat Valenciana (PROMETEO/2021/065).Peer ReviewedPostprint (published version
- …