782 research outputs found

    Scientific Computing Meets Big Data Technology: An Astronomy Use Case

    Full text link
    Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

    Water Distribution System Monitoring and Decision Support Using a Wireless Sensor Network

    Get PDF
    Water distribution systems comprise labyrinthine networks of pipes, often in poor states of repair, that are buried beneath our city streets and relatively inaccessible. Engineers who manage these systems need reliable data to understand and detect water losses due to leaks or burst events, anomalies in the control of water quality and the impacts of operational activities (such as pipe isolation, maintenance or repair) on water supply to customers. Water Wise is a platform that manages and analyses data from a network of wireless sensor nodes, continuously monitoring hydraulic, acoustic and water quality parameters. Water Wise supports many applications including rolling predictions of water demand and hydraulic state, online detection of events such as pipe bursts, and data mining for identification of longer-term trends. This paper illustrates the advantage of the Water Wise platform in resolving operational decisions

    Report from GI-Dagstuhl Seminar 16394: Software Performance Engineering in the DevOps World

    Get PDF
    This report documents the program and the outcomes of GI-Dagstuhl Seminar 16394 "Software Performance Engineering in the DevOps World". The seminar addressed the problem of performance-aware DevOps. Both, DevOps and performance engineering have been growing trends over the past one to two years, in no small part due to the rise in importance of identifying performance anomalies in the operations (Ops) of cloud and big data systems and feeding these back to the development (Dev). However, so far, the research community has treated software engineering, performance engineering, and cloud computing mostly as individual research areas. We aimed to identify cross-community collaboration, and to set the path for long-lasting collaborations towards performance-aware DevOps. The main goal of the seminar was to bring together young researchers (PhD students in a later stage of their PhD, as well as PostDocs or Junior Professors) in the areas of (i) software engineering, (ii) performance engineering, and (iii) cloud computing and big data to present their current research projects, to exchange experience and expertise, to discuss research challenges, and to develop ideas for future collaborations

    Artificial intelligence for the support of regulator decision making

    Get PDF

    Distributed microservices evaluation in edge computing

    Get PDF
    Abstract. Current Internet of Things applications rely on centralized cloud computing when processing data. Future applications, such as smart cities, homes, and vehicles, however, generate so much data that cloud computing is unable to provide the required Quality of Service. Thus, edge computing, which pulls data and related computation from distant data centers to the network edge, is seen as the way forward in the evolution of the Internet of Things. The traditional cloud applications, implemented as centralized server-side monoliths, may prove unfavorable for edge systems, due to the distributed nature of the network edge. On the other hand, the recent development practices of containerization and microservices seem like an attractive choice for edge application development. Containerization enables edge computing to use lightweight virtualized resources. Microservices modularize application on the functional level into small, independent packages. This thesis studies the impact of containers and distributed microservices on edge computing, based on service execution latency and energy consumption. Evaluation is done by developing a monolithic and a distributed microservice version of a user mobility analysis service. Both services are containerized with Docker and deployed on resource-constrained edge devices to conduct measurements in real-world settings. Collected results show that centralized monoliths provide lower latencies for small amounts of data, while distributed microservices are faster for large amounts of data. Partitioning services onto multiple edge devices is shown to increases energy consumption significantly.Hajautettujen mikropalveluiden arviointi reunalaskennassa. Tiivistelmä. Nykyiset Esineiden Internet -järjestelmät hyödyntävät keskitettyä pilvilaskentaa datan prosessointiin. Tulevaisuuden sovellusalueet, kuten älykkäät kaupungit, kodit ja ajoneuvot tuottavat kuitenkin niin paljon dataa, ettei pilvilaskenta pysty täyttämään tarvittavia sovelluspalveluiden laatukriteerejä. Pilvipohjainen sovellusten toteutus on osoittautunut sopimattomaksi hajautetuissa tietoliikenneverkoissa tiedonsiirron viiveiden takia. Täten laskennan ja datan siirtämistä tietoliikenneverkkojen päätepisteisiin reunalaskentaa varten pidetään tärkeänä osana Esineiden Internetin kehitystä. Pilvisovellusten perinteinen keskitetty monoliittinen toteutus saattaa osoittautua sopimattomiksi reunajärjestelmille tietoliikenneverkkojen hajautetun infrastruktuurin takia. Kontit ja mikropalvelut vaikuttavat houkuttelevilta vaihtoehdoilta reunasovellusten suunnitteluun ja toteutukseen. Kontit mahdollistavat reunalaskennalle kevyiden virtualisoitujen resurssien käytön ja mikropalvelut jakavat sovellukset toiminnallisella tasolla pienikokoisiin itsenäisiin osiin. Tässä työssä selvitetään konttien ja hajautettujen mikropalveluiden toteutustavan vaikutusta viiveeseen ja energiankulutukseen reunalaskennassa. Arviointi tehdään todellisessa ympäristössä toteuttamalla mobiilikäyttäjien liikkumista kaupunkialueella analysoiva keskitetty monoliittinen palvelu sekä vastaava hajautettu mikropalvelupohjainen toteutus. Molemmat versiot kontitetaan ja otetaan käyttöön verkon reunalaitteilla, joiden laskentateho on alhainen. Tuloksista nähdään, että keskitettyjen monoliittien viive on alhaisempi pienille datamäärille, kun taas hajautetut mikropalvelut ovat nopeampia suurille määrille dataa. Sovelluksen jakaminen usealle reunalaitteelle kasvatti energiankulutusta huomattavasti

    BOSS-LDG: A Novel Computational Framework that Brings Together Blue Waters, Open Science Grid, Shifter and the LIGO Data Grid to Accelerate Gravitational Wave Discovery

    Get PDF
    We present a novel computational framework that connects Blue Waters, the NSF-supported, leadership-class supercomputer operated by NCSA, to the Laser Interferometer Gravitational-Wave Observatory (LIGO) Data Grid via Open Science Grid technology. To enable this computational infrastructure, we configured, for the first time, a LIGO Data Grid Tier-1 Center that can submit heterogeneous LIGO workflows using Open Science Grid facilities. In order to enable a seamless connection between the LIGO Data Grid and Blue Waters via Open Science Grid, we utilize Shifter to containerize LIGO's workflow software. This work represents the first time Open Science Grid, Shifter, and Blue Waters are unified to tackle a scientific problem and, in particular, it is the first time a framework of this nature is used in the context of large scale gravitational wave data analysis. This new framework has been used in the last several weeks of LIGO's second discovery campaign to run the most computationally demanding gravitational wave search workflows on Blue Waters, and accelerate discovery in the emergent field of gravitational wave astrophysics. We discuss the implications of this novel framework for a wider ecosystem of Higher Performance Computing users.Comment: 10 pages, 10 figures. Accepted as a Full Research Paper to the 13th IEEE International Conference on eScienc

    Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

    Full text link
    Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page

    Sensor Networks for Monitoring and Control of Water Distribution Systems

    Get PDF
    Water distribution systems present a significant challenge for structural monitoring. They comprise a complex network of pipelines buried underground that are relatively inaccessible. Maintaining the integrity of these networks is vital for providing clean drinking water to the general public. There is a need for in-situ, on-line monitoring of water distribution systems in order to facilitate efficient management and operation. In particular, it is important to detect and localize pipe failures soon after they occur, and pre-emptively identify ‘hotspots’, or areas of the distribution network that are more likely to be susceptible to structural failure. These capabilities are vital for reducing the time taken to identify and repair failures and hence, mitigating impacts on water supply. WaterWiSe is a platform that manages and analyses data from a network of intelligent wireless sensor nodes, continuously monitoring hydraulic, acoustic and water quality parameters. WaterWiSe supports many applications including dynamic prediction of water demand and hydraulic state, online detection of events such as pipe bursts, and data mining for identification of longer-term trends. This paper describes the WaterWiSe@SG project in Singapore, focusing on the use of WaterWiSe as a tool for monitoring, detecting and predicting abnormal events that may be indicative of structural pipe failures, such as bursts or leaks.Singapore-MIT Alliance for Research and Technology. Center for Environmental Sensing and Modelin

    Optimised Method of Resource Allocation for Hadoop on Cloud

    Get PDF
    — Many case studies have proved that the data generated at industries and academia are growing rapidly, which are difficult to store using existing database system. Due to the usage of internet many applications are created and has helped many industries such as finance, health care etc, which are also the source of producing massive data. The smart grid is a technology which delivers energy in an optimal manner, phasor measurement unit (PMU) installed in smart grid is used to check the critical power paths and also generate massive sample data. Using parallel detrending fluctuation analysis algorithm (PDFA) fast detection of events from PMU samples are made. Storing and analyzing the events are made easy using MapReduce model, hadoop is an open source implemented MapReduce framework. Many cloud service providers (CSP) are extending their service for Hadoop which makes easy for user’s to run their hadoop application on cloud. The major task is, it is users responsibility to estimate the time and resources required to complete the job within deadlines. In this paper, machine learning techniquies such as local weighted linear regression and the parallel glowworm swarm optimization (GSO) algorithm are used to estimate the resource and job completion time
    corecore