50 research outputs found
A Comparison of Big Data Frameworks on a Layered Dataflow Model
In the world of Big Data analytics, there is a series of tools aiming at
simplifying programming applications to be executed on clusters. Although each
tool claims to provide better programming, data and execution models, for which
only informal (and often confusing) semantics is generally provided, all share
a common underlying model, namely, the Dataflow model. The Dataflow model we
propose shows how various tools share the same expressiveness at different
levels of abstraction. The contribution of this work is twofold: first, we show
that the proposed model is (at least) as general as existing batch and
streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to
understand high-level data-processing applications written in such frameworks.
Second, we provide a layered model that can represent tools and applications
following the Dataflow paradigm and we show how the analyzed tools fit in each
level.Comment: 19 pages, 6 figures, 2 tables, In Proc. of the 9th Intl Symposium on
High-Level Parallel Programming and Applications (HLPP), July 4-5 2016,
Muenster, German
Constraints for behavioural specifications
Behavioural specifications with constraints for the incremental development of algebraic specifications are presented. The behavioural constraints correspond to the completely defined subparts of a given incomplete behavioural specification. Moreover, the local observability criteria used within a behavioural constraint could not coincide with the global criteria used in the behavioural specification. This is absolutely needed because, otherwise, some constraints could involve only non observable sorts and therefore have trivial semantics. Finally, the extension operations and completion operations for refining specifications are defined. The extension operations correspond to horizontal refinements and build larger specifications on top of existing ones in a conservative way. The completion operations correspond to vertical refinements, they add detail to an incomplete behavioural specification and they do restrict the class of models.Postprint (published version
Compile-Time Query Optimization for Big Data Analytics
Many emerging programming environments for large-scale data analysis, such as Map-Reduce, Spark, and Flink, provide Scala-based APIs that consist of powerful higher-order operations that ease the development of complex data analysis applications. However, despite the simplicity of these APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, most current data analysis query languages are based on the relational model and cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model. To address these shortcomings, we introduce a new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer uses algebraic transformations to derive all possible joins in a query, including those hidden across deeply nested queries, thus unnesting nested queries of any form and any number of nesting levels. The optimizer also uses general transformations to push down predicates before joins and to prune unneeded data across operations. DIQL has been implemented on three Big Data platforms, Apache Spark, Apache Flink, and Twitter's Cascading/Scalding, and has been shown to have competitive performance relative to Spark DataFrames and Spark SQL for some complex queries. This paper extends our previous work on embedded data-intensive query languages by describing the complete details of the formal framework and the query translation and optimization processes, and by providing more experimental results that give further evidence of the performance of our system
Proto-Plasm: parallel language for adaptive and scalable modelling of biosystems
This paper discusses the design goals and the first developments of
Proto-Plasm, a novel computational environment to produce libraries
of executable, combinable and customizable computer models of natural and
synthetic biosystems, aiming to provide a supporting framework for predictive
understanding of structure and behaviour through multiscale geometric modelling
and multiphysics simulations. Admittedly, the Proto-Plasm platform is
still in its infancy. Its computational frameworkâlanguage, model library,
integrated development environment and parallel engineâintends to provide
patient-specific computational modelling and simulation of organs and biosystem,
exploiting novel functionalities resulting from the symbolic combination of
parametrized models of parts at various scales. Proto-Plasm may define
the model equations, but it is currently focused on the symbolic description of
model geometry and on the parallel support of simulations. Conversely, CellML
and SBML could be viewed as defining the behavioural functions (the model
equations) to be used within a Proto-Plasm program. Here we exemplify
the basic functionalities of Proto-Plasm, by constructing a schematic
heart model. We also discuss multiscale issues with reference to the geometric
and physical modelling of neuromuscular junctions
Shape-based cost analysis of skeletal parallel programs
Institute for Computing Systems ArchitectureThis work presents an automatic cost-analysis system for an implicitly parallel skeletal
programming language.
Although deducing interesting dynamic characteristics of parallel programs (and in
particular, run time) is well known to be an intractable problem in the general case, it
can be alleviated by placing restrictions upon the programs which can be expressed.
By combining two research threads, the âskeletalâ and âshapelyâ paradigms which
take this route, we produce a completely automated, computation and communication
sensitive cost analysis system. This builds on earlier work in the area by quantifying
communication as well as computation costs, with the former being derived for the
Bulk Synchronous Parallel (BSP) model.
We present details of our shapely skeletal language and its BSP implementation strategy
together with an account of the analysis mechanism by which program behaviour
information (such as shape and cost) is statically deduced. This information can be
used at compile-time to optimise a BSP implementation and to analyse computation
and communication costs. The analysis has been implemented in Haskell. We consider
different algorithms expressed in our language for some example problems and
illustrate each BSP implementation, contrasting the analysis of their efficiency by traditional,
intuitive methods with that achieved by our cost calculator. The accuracy of
cost predictions by our cost calculator against the run time of real parallel programs is
tested experimentally.
Previous shape-based cost analysis required all elements of a vector (our nestable bulk
data structure) to have the same shape. We partially relax this strict requirement on data
structure regularity by introducing new shape expressions in our analysis framework.
We demonstrate that this allows us to achieve the first automated analysis of a complete
derivation, the well known maximum segment sum algorithm of Skillicorn and Cai
Chain-Based Representations for Solid and Physical Modeling
In this paper we show that the (co)chain complex associated with a
decomposition of the computational domain, commonly called a mesh in
computational science and engineering, can be represented by a block-bidiagonal
matrix that we call the Hasse matrix. Moreover, we show that
topology-preserving mesh refinements, produced by the action of (the simplest)
Euler operators, can be reduced to multilinear transformations of the Hasse
matrix representing the complex. Our main result is a new representation of the
(co)chain complex underlying field computations, a representation that provides
new insights into the transformations induced by local mesh refinements. Our
approach is based on first principles and is general in that it applies to most
representational domains that can be characterized as cell complexes, without
any restrictions on their type, dimension, codimension, orientability,
manifoldness, connectedness
PySke: Algorithmic Skeletons for Python
International audiencePySke is a library of parallel algorithmic skeletons in Python designed for list and tree data structures. Such algorithmic skeletons are high-order functions implemented in parallel. An application developed with PySke is a composition of skeletons. To ease the write of parallel programs, PySke does not follow the Single Program Multiple Data (SPMD) paradigm but offers a global view of parallel programs to users. This approach aims at writing scalable programs easily. In addition to the library, we present experiments performed on a high-performance computing cluster (distributed memory) on a set of example applications developed with PySke