90,486 research outputs found

    Parallel Online Aggregation in Action

    Get PDF
    ABSTRACT Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple samplingbased estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration

    Speculative Approximations for Terascale Analytics

    Full text link
    Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascale-size synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th1/20^{\text{th}} fraction of the time

    Learning Scheduling Algorithms for Data Processing Clusters

    Full text link
    Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

    Aggregation kinetics of stiff polyelectrolytes in the presence of multivalent salt

    Full text link
    Using molecular dynamics simulations, the kinetics of bundle formation for stiff polyelectrolytes such as actin is studied in the solution of multivalent salt. The dominant kinetic mode of aggregation is found to be the case of one end of one rod meeting others at right angle due to electrostatic interactions. The kinetic pathway to bundle formation involves a hierarchical structure of small clusters forming initially and then feeding into larger clusters, which is reminiscent of the flocculation dynamics of colloids. For the first few cluster sizes, the Smoluchowski formula for the time evolution of the cluster size gives a reasonable account for the results of our simulation without a single fitting parameter. The description using Smoluchowski formula provides evidence for the aggregation time scale to be controlled by diffusion, with no appreciable energy barrier to overcome.Comment: 6 pages, 5 figures, Phys. Rev. E (Accepted

    Aggregation of self-propelled colloidal rods near confining walls

    Full text link
    Non-equilibrium collective behavior of self-propelled colloidal rods in a confining channel is studied using Brownian dynamics simulations and dynamical density functional theory. We observe an aggregation process in which rods self-organize into transiently jammed clusters at the channel walls. In the early stage of the process, fast-growing hedgehog-like clusters are formed which are largely immobile. At later stages, most of these clusters dissolve and mobilize into nematized aggregates sliding past the walls.Comment: 5 pages, 4 figure

    Benchmarking Distributed Stream Data Processing Systems

    Full text link
    The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.Comment: Published at ICDE 201

    A collective intelligence approach for building student's trustworthiness profile in online learning

    Get PDF
    (c) 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.Information and communication technologies have been widely adopted in most of educational institutions to support e-Learning through different learning methodologies such as computer supported collaborative learning, which has become one of the most influencing learning paradigms. In this context, e-Learning stakeholders, are increasingly demanding new requirements, among them, information security is considered as a critical factor involved in on-line collaborative processes. Information security determines the accurate development of learning activities, especially when a group of students carries out on-line assessment, which conducts to grades or certificates, in these cases, IS is an essential issue that has to be considered. To date, even most advances security technological solutions have drawbacks that impede the development of overall security e-Learning frameworks. For this reason, this paper suggests enhancing technological security models with functional approaches, namely, we propose a functional security model based on trustworthiness and collective intelligence. Both of these topics are closely related to on-line collaborative learning and on-line assessment models. Therefore, the main goal of this paper is to discover how security can be enhanced with trustworthiness in an on-line collaborative learning scenario through the study of the collective intelligence processes that occur on on-line assessment activities. To this end, a peer-to-peer public student's profile model, based on trustworthiness is proposed, and the main collective intelligence processes involved in the collaborative on-line assessments activities, are presented.Peer ReviewedPostprint (author's final draft

    Reinforcement machine learning for predictive analytics in smart cities

    Get PDF
    The digitization of our lives cause a shift in the data production as well as in the required data management. Numerous nodes are capable of producing huge volumes of data in our everyday activities. Sensors, personal smart devices as well as the Internet of Things (IoT) paradigm lead to a vast infrastructure that covers all the aspects of activities in modern societies. In the most of the cases, the critical issue for public authorities (usually, local, like municipalities) is the efficient management of data towards the support of novel services. The reason is that analytics provided on top of the collected data could help in the delivery of new applications that will facilitate citizens’ lives. However, the provision of analytics demands intelligent techniques for the underlying data management. The most known technique is the separation of huge volumes of data into a number of parts and their parallel management to limit the required time for the delivery of analytics. Afterwards, analytics requests in the form of queries could be realized and derive the necessary knowledge for supporting intelligent applications. In this paper, we define the concept of a Query Controller ( QC ) that receives queries for analytics and assigns each of them to a processor placed in front of each data partition. We discuss an intelligent process for query assignments that adopts Machine Learning (ML). We adopt two learning schemes, i.e., Reinforcement Learning (RL) and clustering. We report on the comparison of the two schemes and elaborate on their combination. Our aim is to provide an efficient framework to support the decision making of the QC that should swiftly select the appropriate processor for each query. We provide mathematical formulations for the discussed problem and present simulation results. Through a comprehensive experimental evaluation, we reveal the advantages of the proposed models and describe the outcomes results while comparing them with a deterministic framework
    • …
    corecore