851 research outputs found
Dynamic Physiological Partitioning on a Shared-nothing Database Cluster
Traditional DBMS servers are usually over-provisioned for most of their daily
workloads and, because they do not show good-enough energy proportionality,
waste a lot of energy while underutilized. A cluster of small (wimpy) servers,
where its size can be dynamically adjusted to the current workload, offers
better energy characteristics for these workloads. Yet, data migration,
necessary to balance utilization among the nodes, is a non-trivial and
time-consuming task that may consume the energy saved. For this reason, a
sophisticated and easy to adjust partitioning scheme fostering dynamic
reorganization is needed. In this paper, we adapt a technique originally
created for SMP systems, called physiological partitioning, to distribute data
among nodes, that allows to easily repartition data without interrupting
transactions. We dynamically partition DB tables based on the nodes'
utilization and given energy constraints and compare our approach with physical
partitioning and logical partitioning methods. To quantify possible energy
saving and its conceivable drawback on query runtimes, we evaluate our
implementation on an experimental cluster and compare the results w.r.t.
performance and energy consumption. Depending on the workload, we can
substantially save energy without sacrificing too much performance
Improving Pipelining Tools for Pre-processing Data
The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features
Improving pipelining tools for pre-processing data
The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of
pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming
languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RXunta de Galicia | Ref. ED481D-2021/024Xunta de Galicia | Ref. ED431C2018/55-GR
Event Stream Processing with Multiple Threads
Current runtime verification tools seldom make use of multi-threading to
speed up the evaluation of a property on a large event trace. In this paper, we
present an extension to the BeepBeep 3 event stream engine that allows the use
of multiple threads during the evaluation of a query. Various parallelization
strategies are presented and described on simple examples. The implementation
of these strategies is then evaluated empirically on a sample of problems.
Compared to the previous, single-threaded version of the BeepBeep engine, the
allocation of just a few threads to specific portions of a query provides
dramatic improvement in terms of running time
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
DRS: Dynamic Resource Scheduling for Real-Time Analytics over Fast Streams
In a data stream management system (DSMS), users register continuous queries,
and receive result updates as data arrive and expire. We focus on applications
with real-time constraints, in which the user must receive each result update
within a given period after the update occurs. To handle fast data, the DSMS is
commonly placed on top of a cloud infrastructure. Because stream properties
such as arrival rates can fluctuate unpredictably, cloud resources must be
dynamically provisioned and scheduled accordingly to ensure real-time response.
It is quite essential, for the existing systems or future developments, to
possess the ability of scheduling resources dynamically according to the
current workload, in order to avoid wasting resources, or failing in delivering
correct results on time. Motivated by this, we propose DRS, a novel dynamic
resource scheduler for cloud-based DSMSs. DRS overcomes three fundamental
challenges: (a) how to model the relationship between the provisioned resources
and query response time (b) where to best place resources; and (c) how to
measure system load with minimal overhead. In particular, DRS includes an
accurate performance model based on the theory of \emph{Jackson open queueing
networks} and is capable of handling \emph{arbitrary} operator topologies,
possibly with loops, splits and joins. Extensive experiments with real data
confirm that DRS achieves real-time response with close to optimal resource
consumption.Comment: This is the our latest version with certain modificatio
Saturn: An Optimized Data System for Large Model Deep Learning Workloads
Large language models such as GPT-3 & ChatGPT have transformed deep learning
(DL), powering applications that have captured the public's imagination. These
models are rapidly being adopted across domains for analytics on various
modalities, often by finetuning pre-trained base models. Such models need
multiple GPUs due to both their size and computational load, driving the
development of a bevy of "model parallelism" techniques & tools. Navigating
such parallelism choices, however, is a new burden for end users of DL such as
data scientists, domain scientists, etc. who may lack the necessary systems
knowhow. The need for model selection, which leads to many models to train due
to hyper-parameter tuning or layer-wise finetuning, compounds the situation
with two more burdens: resource apportioning and scheduling. In this work, we
tackle these three burdens for DL users in a unified manner by formalizing them
as a joint problem that we call SPASE: Select a Parallelism, Allocate
resources, and SchedulE. We propose a new information system architecture to
tackle the SPASE problem holistically, representing a key step toward enabling
wider adoption of large DL models. We devise an extensible template for
existing parallelism schemes and combine it with an automated empirical
profiler for runtime estimation. We then formulate SPASE as an MILP.
We find that direct use of an MILP-solver is significantly more effective
than several baseline heuristics. We optimize the system runtime further with
an introspective scheduling approach. We implement all these techniques into a
new data system we call Saturn. Experiments with benchmark DL workloads show
that Saturn achieves 39-49% lower model selection runtimes than typical current
DL practice.Comment: Under submission at VLDB. Code available:
https://github.com/knagrecha/saturn. 12 pages + 3 pages references + 2 pages
appendi
- …