782 research outputs found
Scientific Computing Meets Big Data Technology: An Astronomy Use Case
Scientific analyses commonly compose multiple single-process programs into a
dataflow. An end-to-end dataflow of single-process programs is known as a
many-task application. Typically, tools from the HPC software stack are used to
parallelize these analyses. In this work, we investigate an alternate approach
that uses Apache Spark -- a modern big data platform -- to parallelize
many-task applications. We present Kira, a flexible and distributed astronomy
image processing toolkit using Apache Spark. We then use the Kira toolkit to
implement a Source Extractor application for astronomy images, called Kira SE.
With Kira SE as the use case, we study the programming flexibility, dataflow
richness, scheduling capacity and performance of Apache Spark running on the
EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an
equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon
EC2 cloud. Furthermore, we show that by leveraging software originally designed
for big data infrastructure, Kira SE achieves competitive performance to the C
implementation running on the NERSC Edison supercomputer. Our experience with
Kira indicates that emerging Big Data platforms such as Apache Spark are a
performant alternative for many-task scientific applications
Water Distribution System Monitoring and Decision Support Using a Wireless Sensor Network
Water distribution systems comprise labyrinthine networks of pipes, often in poor states of repair, that are buried beneath our city streets and relatively inaccessible. Engineers who manage these systems need reliable data to understand and detect water losses due to leaks or burst events, anomalies in the control of water quality and the impacts of operational activities (such as pipe isolation, maintenance or repair) on water supply to customers. Water Wise is a platform that manages and analyses data from a network of wireless sensor nodes, continuously monitoring hydraulic, acoustic and water quality parameters. Water Wise supports many applications including rolling predictions of water demand and hydraulic state, online detection of events such as pipe bursts, and data mining for identification of longer-term trends. This paper illustrates the advantage of the Water Wise platform in resolving operational decisions
Report from GI-Dagstuhl Seminar 16394: Software Performance Engineering in the DevOps World
This report documents the program and the outcomes of GI-Dagstuhl Seminar
16394 "Software Performance Engineering in the DevOps World".
The seminar addressed the problem of performance-aware DevOps. Both, DevOps
and performance engineering have been growing trends over the past one to two
years, in no small part due to the rise in importance of identifying
performance anomalies in the operations (Ops) of cloud and big data systems and
feeding these back to the development (Dev). However, so far, the research
community has treated software engineering, performance engineering, and cloud
computing mostly as individual research areas. We aimed to identify
cross-community collaboration, and to set the path for long-lasting
collaborations towards performance-aware DevOps.
The main goal of the seminar was to bring together young researchers (PhD
students in a later stage of their PhD, as well as PostDocs or Junior
Professors) in the areas of (i) software engineering, (ii) performance
engineering, and (iii) cloud computing and big data to present their current
research projects, to exchange experience and expertise, to discuss research
challenges, and to develop ideas for future collaborations
Distributed microservices evaluation in edge computing
Abstract. Current Internet of Things applications rely on centralized cloud computing when processing data. Future applications, such as smart cities, homes, and vehicles, however, generate so much data that cloud computing is unable to provide the required Quality of Service. Thus, edge computing, which pulls data and related computation from distant data centers to the network edge, is seen as the way forward in the evolution of the Internet of Things.
The traditional cloud applications, implemented as centralized server-side monoliths, may prove unfavorable for edge systems, due to the distributed nature of the network edge. On the other hand, the recent development practices of containerization and microservices seem like an attractive choice for edge application development. Containerization enables edge computing to use lightweight virtualized resources. Microservices modularize application on the functional level into small, independent packages.
This thesis studies the impact of containers and distributed microservices on edge computing, based on service execution latency and energy consumption. Evaluation is done by developing a monolithic and a distributed microservice version of a user mobility analysis service. Both services are containerized with Docker and deployed on resource-constrained edge devices to conduct measurements in real-world settings.
Collected results show that centralized monoliths provide lower latencies for small amounts of data, while distributed microservices are faster for large amounts of data. Partitioning services onto multiple edge devices is shown to increases energy consumption significantly.Hajautettujen mikropalveluiden arviointi reunalaskennassa. Tiivistelmä. Nykyiset Esineiden Internet -järjestelmät hyödyntävät keskitettyä pilvilaskentaa datan prosessointiin. Tulevaisuuden sovellusalueet, kuten älykkäät kaupungit, kodit ja ajoneuvot tuottavat kuitenkin niin paljon dataa, ettei pilvilaskenta pysty täyttämään tarvittavia sovelluspalveluiden laatukriteerejä. Pilvipohjainen sovellusten toteutus on osoittautunut sopimattomaksi hajautetuissa tietoliikenneverkoissa tiedonsiirron viiveiden takia. Täten laskennan ja datan siirtämistä tietoliikenneverkkojen päätepisteisiin reunalaskentaa varten pidetään tärkeänä osana Esineiden Internetin kehitystä.
Pilvisovellusten perinteinen keskitetty monoliittinen toteutus saattaa osoittautua sopimattomiksi reunajärjestelmille tietoliikenneverkkojen hajautetun infrastruktuurin takia. Kontit ja mikropalvelut vaikuttavat houkuttelevilta vaihtoehdoilta reunasovellusten suunnitteluun ja toteutukseen. Kontit mahdollistavat reunalaskennalle kevyiden virtualisoitujen resurssien käytön ja mikropalvelut jakavat sovellukset toiminnallisella tasolla pienikokoisiin itsenäisiin osiin.
Tässä työssä selvitetään konttien ja hajautettujen mikropalveluiden toteutustavan vaikutusta viiveeseen ja energiankulutukseen reunalaskennassa. Arviointi tehdään todellisessa ympäristössä toteuttamalla mobiilikäyttäjien liikkumista kaupunkialueella analysoiva keskitetty monoliittinen palvelu sekä vastaava hajautettu mikropalvelupohjainen toteutus. Molemmat versiot kontitetaan ja otetaan käyttöön verkon reunalaitteilla, joiden laskentateho on alhainen.
Tuloksista nähdään, että keskitettyjen monoliittien viive on alhaisempi pienille datamäärille, kun taas hajautetut mikropalvelut ovat nopeampia suurille määrille dataa. Sovelluksen jakaminen usealle reunalaitteelle kasvatti energiankulutusta huomattavasti
BOSS-LDG: A Novel Computational Framework that Brings Together Blue Waters, Open Science Grid, Shifter and the LIGO Data Grid to Accelerate Gravitational Wave Discovery
We present a novel computational framework that connects Blue Waters, the
NSF-supported, leadership-class supercomputer operated by NCSA, to the Laser
Interferometer Gravitational-Wave Observatory (LIGO) Data Grid via Open Science
Grid technology. To enable this computational infrastructure, we configured,
for the first time, a LIGO Data Grid Tier-1 Center that can submit
heterogeneous LIGO workflows using Open Science Grid facilities. In order to
enable a seamless connection between the LIGO Data Grid and Blue Waters via
Open Science Grid, we utilize Shifter to containerize LIGO's workflow software.
This work represents the first time Open Science Grid, Shifter, and Blue Waters
are unified to tackle a scientific problem and, in particular, it is the first
time a framework of this nature is used in the context of large scale
gravitational wave data analysis. This new framework has been used in the last
several weeks of LIGO's second discovery campaign to run the most
computationally demanding gravitational wave search workflows on Blue Waters,
and accelerate discovery in the emergent field of gravitational wave
astrophysics. We discuss the implications of this novel framework for a wider
ecosystem of Higher Performance Computing users.Comment: 10 pages, 10 figures. Accepted as a Full Research Paper to the 13th
IEEE International Conference on eScienc
Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
Data collection for scientific applications is increasing exponentially and
is forecasted to soon reach peta- and exabyte scales. Applications which
process and analyze scientific data must be scalable and focus on execution
performance to keep pace. In the field of radio astronomy, in addition to
increasingly large datasets, tasks such as the identification of transient
radio signals from extrasolar sources are computationally expensive. We present
a scalable approach to radio pulsar detection written in Scala that
parallelizes candidate identification to take advantage of in-memory task
processing using Apache Spark on a YARN distributed system. Furthermore, we
introduce a novel automated multiclass supervised machine learning technique
that we combine with feature selection to reduce the time required for
candidate classification. Experimental testing on a Beowulf cluster with 15
data nodes shows that the parallel implementation of the identification
algorithm offers a speedup of up to 5X that of a similar multithreaded
implementation. Further, we show that the combination of automated multiclass
classification and feature selection speeds up the execution performance of the
RandomForest machine learning algorithm by an average of 54% with less than a
2% average reduction in the algorithm's ability to correctly classify pulsars.
The generalizability of these results is demonstrated by using two real-world
radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel
Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page
Sensor Networks for Monitoring and Control of Water Distribution Systems
Water distribution systems present a significant challenge for structural monitoring. They comprise a complex network of pipelines buried underground that are relatively inaccessible. Maintaining the integrity of these networks is vital for providing clean drinking water to the general public. There is a need for in-situ, on-line monitoring of water distribution systems in order to facilitate efficient management and operation. In particular, it is important to detect and localize pipe failures soon after they occur, and pre-emptively identify ‘hotspots’, or areas of the distribution network that are more likely to be susceptible to structural failure. These capabilities are vital for reducing the time taken to identify and repair failures and hence, mitigating impacts on water supply.
WaterWiSe is a platform that manages and analyses data from a network of intelligent wireless sensor nodes, continuously monitoring hydraulic, acoustic and water quality parameters. WaterWiSe supports many applications including dynamic prediction of water demand and hydraulic state, online detection of events such as pipe bursts, and data mining for identification of longer-term trends. This paper describes the WaterWiSe@SG project in Singapore, focusing on the use of WaterWiSe as a tool for monitoring, detecting and predicting abnormal events that may be indicative of structural pipe failures, such as bursts or leaks.Singapore-MIT Alliance for Research and Technology. Center for Environmental Sensing and Modelin
Optimised Method of Resource Allocation for Hadoop on Cloud
— Many case studies have proved that the data generated at industries and academia are growing rapidly, which are difficult to store using existing database system. Due to the usage of internet many applications are created and has helped many industries such as finance, health care etc, which are also the source of producing massive data. The smart grid is a technology which delivers energy in an optimal manner, phasor measurement unit (PMU) installed in smart grid is used to check the critical power paths and also generate massive sample data. Using parallel detrending fluctuation analysis algorithm (PDFA) fast detection of events from PMU samples are made. Storing and analyzing the events are made easy using MapReduce model, hadoop is an open source implemented MapReduce framework. Many cloud service providers (CSP) are extending their service for Hadoop which makes easy for user’s to run their hadoop application on cloud. The major task is, it is users responsibility to estimate the time and resources required to complete the job within deadlines. In this paper, machine learning techniquies such as local weighted linear regression and the parallel glowworm swarm optimization (GSO) algorithm are used to estimate the resource and job completion time
- …