90,486 research outputs found
Parallel Online Aggregation in Action
ABSTRACT Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple samplingbased estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Aggregation kinetics of stiff polyelectrolytes in the presence of multivalent salt
Using molecular dynamics simulations, the kinetics of bundle formation for
stiff polyelectrolytes such as actin is studied in the solution of multivalent
salt. The dominant kinetic mode of aggregation is found to be the case of one
end of one rod meeting others at right angle due to electrostatic interactions.
The kinetic pathway to bundle formation involves a hierarchical structure of
small clusters forming initially and then feeding into larger clusters, which
is reminiscent of the flocculation dynamics of colloids. For the first few
cluster sizes, the Smoluchowski formula for the time evolution of the cluster
size gives a reasonable account for the results of our simulation without a
single fitting parameter. The description using Smoluchowski formula provides
evidence for the aggregation time scale to be controlled by diffusion, with no
appreciable energy barrier to overcome.Comment: 6 pages, 5 figures, Phys. Rev. E (Accepted
Aggregation of self-propelled colloidal rods near confining walls
Non-equilibrium collective behavior of self-propelled colloidal rods in a
confining channel is studied using Brownian dynamics simulations and dynamical
density functional theory. We observe an aggregation process in which rods
self-organize into transiently jammed clusters at the channel walls. In the
early stage of the process, fast-growing hedgehog-like clusters are formed
which are largely immobile. At later stages, most of these clusters dissolve
and mobilize into nematized aggregates sliding past the walls.Comment: 5 pages, 4 figure
Benchmarking Distributed Stream Data Processing Systems
The need for scalable and efficient stream analysis has led to the
development of many open-source streaming data processing systems (SDPSs) with
highly diverging capabilities and performance characteristics. While first
initiatives try to compare the systems for simple workloads, there is a clear
gap of detailed analyses of the systems' performance characteristics. In this
paper, we propose a framework for benchmarking distributed stream processing
engines. We use our suite to evaluate the performance of three widely used
SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our
evaluation focuses in particular on measuring the throughput and latency of
windowed operations, which are the basic type of operations in stream
analytics. For this benchmark, we design workloads based on real-life,
industrial use-cases inspired by the online gaming industry. The contribution
of our work is threefold. First, we give a definition of latency and throughput
for stateful operators. Second, we carefully separate the system under test and
driver, in order to correctly represent the open world model of typical stream
processing deployments and can, therefore, measure system performance under
realistic conditions. Third, we build the first benchmarking framework to
define and test the sustainable performance of streaming systems.
Our detailed evaluation highlights the individual characteristics and
use-cases of each system.Comment: Published at ICDE 201
A collective intelligence approach for building student's trustworthiness profile in online learning
(c) 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.Information and communication technologies have been widely adopted in most of educational institutions to support e-Learning through different learning methodologies such as computer supported collaborative learning, which has become one of the most influencing learning paradigms. In this context, e-Learning stakeholders, are increasingly demanding new requirements, among them, information security is considered as a critical factor involved in on-line collaborative processes. Information security determines the accurate development of learning activities, especially when a group of students carries out on-line assessment, which conducts to grades or certificates, in these cases, IS is an essential issue that has to be considered. To date, even most advances security technological solutions have drawbacks that impede the development of overall security e-Learning frameworks. For this reason, this paper suggests enhancing technological security models with functional approaches, namely, we propose a functional security model based on trustworthiness and collective intelligence. Both of these topics are closely related to on-line collaborative learning and on-line assessment models. Therefore, the main goal of this paper is to discover how security can be enhanced with trustworthiness in an on-line collaborative learning scenario through the study of the collective intelligence processes that occur on on-line assessment activities. To this end, a peer-to-peer public student's profile model, based on trustworthiness is proposed, and the main collective intelligence processes involved in the collaborative on-line assessments activities, are presented.Peer ReviewedPostprint (author's final draft
Reinforcement machine learning for predictive analytics in smart cities
The digitization of our lives cause a shift in the data production as well as in the required data management. Numerous nodes are capable of producing huge volumes of data in our everyday activities. Sensors, personal smart devices as well as the Internet of Things (IoT) paradigm lead to a vast infrastructure that covers all the aspects of activities in modern societies. In the most of the cases, the critical issue for public authorities (usually, local, like municipalities) is the efficient management of data towards the support of novel services. The reason is that analytics provided on top of the collected data could help in the delivery of new applications that will facilitate citizens’ lives. However, the provision of analytics demands intelligent techniques for the underlying data management. The most known technique is the separation of huge volumes of data into a number of parts and their parallel management to limit the required time for the delivery of analytics. Afterwards, analytics requests in the form of queries could be realized and derive the necessary knowledge for supporting intelligent applications. In this paper, we define the concept of a Query Controller ( QC ) that receives queries for analytics and assigns each of them to a processor placed in front of each data partition. We discuss an intelligent process for query assignments that adopts Machine Learning (ML). We adopt two learning schemes, i.e., Reinforcement Learning (RL) and clustering. We report on the comparison of the two schemes and elaborate on their combination. Our aim is to provide an efficient framework to support the decision making of the QC that should swiftly select the appropriate processor for each query. We provide mathematical formulations for the discussed problem and present simulation results. Through a comprehensive experimental evaluation, we reveal the advantages of the proposed models and describe the outcomes results while comparing them with a deterministic framework
- …