9 research outputs found
A performance comparison of Dask and Apache Spark for data-intensive neuroimaging pipelines
In the past few years, neuroimaging has entered the Big Data era due to the
joint increase in image resolution, data sharing, and study sizes. However, no
particular Big Data engines have emerged in this field, and several
alternatives remain available. We compare two popular Big Data engines with
Python APIs, Apache Spark and Dask, for their runtime performance in processing
neuroimaging pipelines. Our evaluation uses two synthetic pipelines processing
the 81GB BigBrain image, and a real pipeline processing anatomical data from
more than 1,000 subjects. We benchmark these pipelines using various
combinations of task durations, data sizes, and numbers of workers, deployed on
an 8-node (8 cores ea.) compute cluster in Compute Canada's Arbutus cloud. We
evaluate PySpark's RDD API against Dask's Bag, Delayed and Futures. Results
show that despite slight differences between Spark and Dask, both engines
perform comparably. However, Dask pipelines risk being limited by Python's GIL
depending on task type and cluster configuration. In all cases, the major
limiting factor was data transfer. While either engine is suitable for
neuroimaging pipelines, more effort needs to be placed in reducing data
transfer time.Comment: 10 pages, 15 figures, 1 tables. To appear in the proceeding of the
14th WORKS Workshop on Topics in Workflows in Support of Large-Scale Science,
17 November 2019, Denver, CO, US
CapillaryX: A Software Design Pattern for Analyzing Medical Images in Real-time using Deep Learning
Recent advances in digital imaging, e.g., increased number of pixels
captured, have meant that the volume of data to be processed and analyzed from
these images has also increased. Deep learning algorithms are state-of-the-art
for analyzing such images, given their high accuracy when trained with a large
data volume of data. Nevertheless, such analysis requires considerable
computational power, making such algorithms time- and resource-demanding. Such
high demands can be met by using third-party cloud service providers. However,
analyzing medical images using such services raises several legal and privacy
challenges and does not necessarily provide real-time results. This paper
provides a computing architecture that locally and in parallel can analyze
medical images in real-time using deep learning thus avoiding the legal and
privacy challenges stemming from uploading data to a third-party cloud
provider. To make local image processing efficient on modern multi-core
processors, we utilize parallel execution to offset the resource-intensive
demands of deep neural networks. We focus on a specific medical-industrial case
study, namely the quantifying of blood vessels in microcirculation images for
which we have developed a working system. It is currently used in an
industrial, clinical research setting as part of an e-health application. Our
results show that our system is approximately 78% faster than its serial system
counterpart and 12% faster than a master-slave parallel system architecture
CapillaryX: A Software Design Pattern for Analyzing Medical Images in Real-time using Deep Learning
Abstract Recent advances in digital imaging, e.g., increased number of pixels captured, have meant that the volume of data to be processed and analyzed from these images has also increased. Deep learning algorithms are state-of-the-art for analyzing such images, given their high accuracy when trained with a large data volume of data. Nevertheless, such analysis requires considerable computational power, making such algorithms time- and resource-demanding. Such high demands can be met by using third-party cloud service providers. However, analyzing medical images using such services raises several legal and privacy challenges and do not necessarily provide real-time results. This paper provides a computing architecture that locally and in parallel can analyze medical images in real-time using deep learning thus avoiding the legal and privacy challenges stemming from uploading data to a third-party cloud provider. To make local image processing efficient on modern multi-core processors, we utilize parallel execution to offset the resource- intensive demands of deep neural networks. We focus on a specific medical-industrial case study, namely the quantifying of blood vessels in microcirculation images for which we have developed a working system. It is currently used in an industrial, clinical research setting as part of an e-health application. Our results show that our system is approximately 78% faster than its serial system counterpart and 12% faster than a master-slave parallel system architecture
Data redundancy reduction for energy-efficiency in wireless sensor networks: a comprehensive review
Wireless Sensor Networks (WSNs) play a significant role in providing an extraordinary infrastructure for monitoring environmental variations such as climate change, volcanoes, and other natural disasters. In a hostile environment, sensors' energy is one of the crucial concerns in collecting and analyzing accurate data. However, various environmental conditions, short-distance adjacent devices, and extreme usage of resources, i.e., battery power in WSNs, lead to a high possibility of redundant data. Accordingly, the reduction in redundant data is required for both resources and accurate information. In this context, this paper presents a comprehensive review of the existing energy-efficient data redundancy reduction schemes with their benefits and limitations for WSNs. The entire concept of data redundancy reduction is classified into three levels, which are node, cluster head, and sink. Additionally, this paper highlights existing key issues and challenges and suggested future work in reducing data redundancy for future research