1,453 research outputs found

    DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge

    Full text link
    The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, allowing them to collectively optimise large-scale data processing solutions in a coherent manner. The execution in DALiuGE is data-activated, where each individual data item autonomously triggers the processing on itself. Such decentralisation also makes the execution framework very scalable and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world. DALiuGE has been used in production for reducing interferometry data sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide Spectral Radioheliograph; and is being developed as the execution framework prototype for the Science Data Processor (SDP) consortium of the Square Kilometre Array (SKA) telescope. This paper presents a technical overview of DALiuGE and discusses case studies from the CHILES and MUSER projects that use DALiuGE to execute production pipelines. In a companion paper, we provide in-depth analysis of DALiuGE's scalability to very large numbers of tasks on two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and Computin

    Modern web-programming language concurrency

    Get PDF
    This Masters Thesis compares Elixir, Go and JavaScript (Node.js) as programming language candi- dates for writing concurrent RESTful webservice backends. First we describe each of the languages. Next we compare the functional concurrency characteristics of the languages to each other. Finally we do scalability testing for each of the languages. Scalability testing is done using the Locust.io framework. For testing purposes we introduce for simple REST-api implementations for each of the languages. Result from the tests was that JavaScript performed the worst of the languages and Go was the most verbose language to program with

    An All-in-One Debugging Approach: Java Debugging, Execution Visualization and Verification

    Get PDF
    We devise a widely applicable debugging approach to deal with the prevailing issue that bugs cannot be precisely reproduced in nondeterministic complex concurrent programs. A distinct efficient record-and-playback mechanism is designed to record all the internal states of execution including intermediate results by injecting our own bytecode, which does not affect the source code, and, through a two-step data processing mechanism, these data will be aggregated, structured and parallel processed for the purpose of replay in high fidelity while keeping the overhead at a satisfactory level. Docker and Git are employed to create a clean environment such that the execution will be undertaken repeatedly with a maximum precision of reproducing bugs. In our development, several other forefront technologies, such as MongoDB, Spark and Node.js are utilized and smoothly integrated for easy implementation. Altogether, we develop a system for Java Debugging Execution Visualization and Verification (JDevv), a debugging tool for Java although our debugging approach can apply to other languages as well. JDevv also offers an aggregated and interactive visualization for the ease of users’ code verification

    Joining and aggregating datasets using CouchDB

    Get PDF
    Data mining typically requires implementing operations that involve cross-cutting entity boundaries and are awkward to implement in document-oriented databases. CouchDB, for example, models entities as documents, with highly isolated entity boundaries, and on which joins cannot be directly performed. This project shows how join and aggregation can be achieved across entity boundaries in such systems, as encountered for example in the pre-processing and exploration stages of educational data mining. A software stack is presented as a means by which this can be achieved; first, datasets are processed via ETL operations, then MapReduce is used to create indices of ordered and aggregated data. Finally, a Couchdb list function is used to iterate through these indices and perform joins, and to compute aggregated values on joined datasets such as variance and correlations. In terms of the case study, it is shown that the proposed approach to implementing cross-document joins and aggregation is effective and scalable. In addition, it was discovered that for the 2014 - 2016 UCT cohorts, NBT scores correlate better with final grades for the CSC1015F course than do Grade 12 results for English, Science and Mathematics

    Optimizing sequences traversal and extensibility

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia Informática e de ComputadoresGeradores yield são uma característica de programação bem conhecida, disponível na maioria dos ambientes de programação usados, como JavaScript, Python e muitos outros. Permitem uma extensibilidade fácil e compacta em operações de streams, como em iteradores ou tipos enumeráveis. Ainda assim, surgem duas questões sobre a sua utilização: 1) Os geradores são a melhor escolha para estender sequências com novas operações definidas pelo programador? 2) E se as linguagens de programação de desenvolvimento não fornecerem geradores yield, como em Java? O trabalho de pesquisa que descrevo nesta dissertação visa responder a essas duas questões. Para tal, analisei dois desenhos de tipo de sequência de linguagens de programação diferentes, nomeadamente, Java e Javascript. Além disso, estudei as alternativas mais utilizadas às sequências incluídas em cada linguagem, num conjunto de características, criando benchmarks para analisar o desempenho de cada uma em casos de utilização baseados no mundo real, disponíveis para cada programador poder usar quando quiser escolher um tipo de sequência de acordo com suas necessidades. Para além disto, proponho a minha própria solução para um tipo de sequência, baseado num desenho minimalista que permite não só a extensão concisa da sua API como o encadeamento fluente de operações definidas pelo utilizador. A minha proposta tem como objectivo ser quão simples e transparente quanto possível, para que qualquer programador consiga perceber claramente aquilo que está a usar.Por fim, respondo à questão "Quando se deve usar paralelismo?"com um conjunto de benchmarks que comparam o processamento sequencial das Streams do Java com o seu processamento paralelo.Yield generators are a well-known programming feature available in most used programming environments such as JavaScript, Python and many others. They allow easy and compact extensibility on streams operations such as on iterators or enumerable types. Yet, two questions arise about their use: 1) are generators the most efficient choice to extend sequences with new user-defined operations? 2) What if the development programming languages does not provide the yield feature, such as in Java? The research work that I describe in this dissertation aims to answer these two questions. To that end, I analyzed two different programming languages designs for a sequence type, Java and Javascript. Also, I studied the state-of-the-art alternatives to the out-of-the-box sequences included in each language, in a set of features, devising benchmarks to analyze their performance with real world usecases, available for developers to use when choosing a sequence type according to their needs. Not only that but, I also propose my own solution of a sequence type, based on a minimalist design that both allows for verboseless extension as well as fluent chaining of new operations. My proposal aims to be as simple and transparent as possible so the developer may clearly understand what he is using. Finally, I answer the question "When should you use parallelism?" with a set of benchmarks that compare Java Streams sequential processing with its parallel counterpart.N/

    Continuation-Passing C: compiling threads to events through continuations

    Get PDF
    In this paper, we introduce Continuation Passing C (CPC), a programming language for concurrent systems in which native and cooperative threads are unified and presented to the programmer as a single abstraction. The CPC compiler uses a compilation technique, based on the CPS transform, that yields efficient code and an extremely lightweight representation for contexts. We provide a proof of the correctness of our compilation scheme. We show in particular that lambda-lifting, a common compilation technique for functional languages, is also correct in an imperative language like C, under some conditions enforced by the CPC compiler. The current CPC compiler is mature enough to write substantial programs such as Hekate, a highly concurrent BitTorrent seeder. Our benchmark results show that CPC is as efficient, while using significantly less space, as the most efficient thread libraries available.Comment: Higher-Order and Symbolic Computation (2012). arXiv admin note: substantial text overlap with arXiv:1202.324

    ImageJ2: ImageJ for the next generation of scientific image data

    Full text link
    ImageJ is an image analysis program extensively used in the biological sciences and beyond. Due to its ease of use, recordable macro language, and extensible plug-in architecture, ImageJ enjoys contributions from non-programmers, amateur programmers, and professional developers alike. Enabling such a diversity of contributors has resulted in a large community that spans the biological and physical sciences. However, a rapidly growing user base, diverging plugin suites, and technical limitations have revealed a clear need for a concerted software engineering effort to support emerging imaging paradigms, to ensure the software's ability to handle the requirements of modern science. Due to these new and emerging challenges in scientific imaging, ImageJ is at a critical development crossroads. We present ImageJ2, a total redesign of ImageJ offering a host of new functionality. It separates concerns, fully decoupling the data model from the user interface. It emphasizes integration with external applications to maximize interoperability. Its robust new plugin framework allows everything from image formats, to scripting languages, to visualization to be extended by the community. The redesigned data model supports arbitrarily large, N-dimensional datasets, which are increasingly common in modern image acquisition. Despite the scope of these changes, backwards compatibility is maintained such that this new functionality can be seamlessly integrated with the classic ImageJ interface, allowing users and developers to migrate to these new methods at their own pace. ImageJ2 provides a framework engineered for flexibility, intended to support these requirements as well as accommodate future needs
    corecore