459 research outputs found

    DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge

    Full text link
    The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, allowing them to collectively optimise large-scale data processing solutions in a coherent manner. The execution in DALiuGE is data-activated, where each individual data item autonomously triggers the processing on itself. Such decentralisation also makes the execution framework very scalable and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world. DALiuGE has been used in production for reducing interferometry data sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide Spectral Radioheliograph; and is being developed as the execution framework prototype for the Science Data Processor (SDP) consortium of the Square Kilometre Array (SKA) telescope. This paper presents a technical overview of DALiuGE and discusses case studies from the CHILES and MUSER projects that use DALiuGE to execute production pipelines. In a companion paper, we provide in-depth analysis of DALiuGE's scalability to very large numbers of tasks on two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and Computin

    ARM Wrestling with Big Data: A Study of Commodity ARM64 Server for Big Data Workloads

    Full text link
    ARM processors have dominated the mobile device market in the last decade due to their favorable computing to energy ratio. In this age of Cloud data centers and Big Data analytics, the focus is increasingly on power efficient processing, rather than just high throughput computing. ARM's first commodity server-grade processor is the recent AMD A1100-series processor, based on a 64-bit ARM Cortex A57 architecture. In this paper, we study the performance and energy efficiency of a server based on this ARM64 CPU, relative to a comparable server running an AMD Opteron 3300-series x64 CPU, for Big Data workloads. Specifically, we study these for Intel's HiBench suite of web, query and machine learning benchmarks on Apache Hadoop v2.7 in a pseudo-distributed setup, for data sizes up to 20GB20GB files, 5M5M web pages and 500M500M tuples. Our results show that the ARM64 server's runtime performance is comparable to the x64 server for integer-based workloads like Sort and Hive queries, and only lags behind for floating-point intensive benchmarks like PageRank, when they do not exploit data parallelism adequately. We also see that the ARM64 server takes 13rd\frac{1}{3}^{rd} the energy, and has an Energy Delay Product (EDP) that is 5071%50-71\% lower than the x64 server. These results hold promise for ARM64 data centers hosting Big Data workloads to reduce their operational costs, while opening up opportunities for further analysis.Comment: Accepted for publication in the Proceedings of the 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 201

    Towards a big data reference architecture

    Get PDF

    Big Data Testing Techniques: Taxonomy, Challenges and Future Trends

    Full text link
    Big Data is reforming many industrial domains by providing decision support through analyzing large data volumes. Big Data testing aims to ensure that Big Data systems run smoothly and error-free while maintaining the performance and quality of data. However, because of the diversity and complexity of data, testing Big Data is challenging. Though numerous research efforts deal with Big Data testing, a comprehensive review to address testing techniques and challenges of Big Data is not available as yet. Therefore, we have systematically reviewed the Big Data testing techniques evidence occurring in the period 2010-2021. This paper discusses testing data processing by highlighting the techniques used in every processing phase. Furthermore, we discuss the challenges and future directions. Our findings show that diverse functional, non-functional and combined (functional and non-functional) testing techniques have been used to solve specific problems related to Big Data. At the same time, most of the testing challenges have been faced during the MapReduce validation phase. In addition, the combinatorial testing technique is one of the most applied techniques in combination with other techniques (i.e., random testing, mutation testing, input space partitioning and equivalence testing) to find various functional faults through Big Data testing.Comment: 32 page

    Development of an Application to Run Integration Tests on a Data Pipeline

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsAs new technologies continue to surge, the ever-growing complexity of modern software products and architectures significantly increases the difficulty and development costs of a fully tested application. As agile methodologies enforce faster delivery of new features, a tool to automatically test the different components of an application has become an essential prerequisite in many software development teams. In this context, this report describes the development of an application for the automatic execution of integration tests. The application was developed during a data engineering internship at Xing, a very well-established German career-oriented professional networking platform. At the time of writing Xing counts more than 19 million users, most of them from Germany, Austria, and Switzerland. The internship project was carried out in a team focused on data engineering projects and, following the Kaban methodology, an application was developed to automatically perform integration tests on the different components involved in the creation of a type of update on the platform. The application was also coupled to the tool used by the team in the adopted software development continuous integration and delivery practices This report describes the developed project, which successfully achieved the proposed objectives, and delivered as a final product an application that will serve as a framework to perform integration tests, in an automated way, in the data pipelines for the creation of updates on the platform

    Large Scale Feature Extraction from Linked Web Data

    Get PDF
    Veebiandmed on ajas muutuvad ning viis, kuidas neid esitatakse muutub samuti. Linkandmed on muutnud veebis leiduva info masinloetavaks. Selles töös esitame kontseptsioonitõenduseks lahenduse, mis võtab veebisorimise andmetest linkandmed ja teostab nende peal tunnusehõivet. Esitletud lahenduse eesmärgiks on luua sisendeid masinõppe mudelite treenimiseks, mida kasutatakse firmade krediidiskoori hindamiseks. Meie näitelahendus keskendub toote linkandmetele. Me proovime ühendadatoodete linkandmed, mis esitavad sama toodet, aga pärinevad erinevatelt veebilehtedelt.Toodete linkandmed ühendatakse firmadega, mille lehelt tooted pärit on. Informatsioon firmadest ja nende toodetest moodustab graafi, millel arvutame graafimeetrikuid.Erinevate ajahetketede veebisorimisandmetel arvutatud graafimeetrikud moodustavad ajaseeria, mis näitab graafi muutusi läbi aja. Saadud ajaseeriatel rakendame tunnushõive arvutamist.Loodud lahendus on planeeritud suurte andmete jaoks ning ehitatud ja disainitud skaleeruvust silmas pidades. Me kasutame Apache Sparki, et töödelda suurt hulka andmeid kiiresti ning olla valmis, kui sisendandmete hulk suureneb 100 korda.Data available on the web is evolving, and the way it is represented is changing as well.Linked data has made information on the web understandable to machines. In this thesis we develop a proof of concept pipeline that extracts linked data from web crawling and performs feature extraction on it. The end goal of this pipeline is to provide input to machine learning models that are used for credit scoring. The use case focuses on extracting product linked data and connecting it with the company that offers it. Built solution attempts to detect if two products from different web sites are the same in order to use one representation for both. Information about companies and products is represented as a graph on which network metrics are calculated. Network metrics from multiple different web crawls are stored in time series that shows changes in graph over time. We then calculate derivatives on the values in time series.The developed pipeline is designed to handle data in terabytes and built with scalability in mind. We use Apache Spark to process huge amounts of data and to be ready if input data increases 100 times

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    An All-in-One Debugging Approach: Java Debugging, Execution Visualization and Verification

    Get PDF
    We devise a widely applicable debugging approach to deal with the prevailing issue that bugs cannot be precisely reproduced in nondeterministic complex concurrent programs. A distinct efficient record-and-playback mechanism is designed to record all the internal states of execution including intermediate results by injecting our own bytecode, which does not affect the source code, and, through a two-step data processing mechanism, these data will be aggregated, structured and parallel processed for the purpose of replay in high fidelity while keeping the overhead at a satisfactory level. Docker and Git are employed to create a clean environment such that the execution will be undertaken repeatedly with a maximum precision of reproducing bugs. In our development, several other forefront technologies, such as MongoDB, Spark and Node.js are utilized and smoothly integrated for easy implementation. Altogether, we develop a system for Java Debugging Execution Visualization and Verification (JDevv), a debugging tool for Java although our debugging approach can apply to other languages as well. JDevv also offers an aggregated and interactive visualization for the ease of users’ code verification
    corecore