5 research outputs found
The DeepHealth Toolkit: A Unified Framework to Boost Biomedical Applications
Given the overwhelming impact of machine learning on the last decade, several libraries and frameworks have been developed in recent years to simplify the design and training of neural networks, providing array-based programming, automatic differentiation and user-friendly access to hardware accelerators. None of those tools, however, was designed with native and transparent support for Cloud Computing or heterogeneous High-Performance Computing (HPC). The DeepHealth Toolkit is an open source Deep Learning toolkit aimed at boosting productivity of data scientists operating in the medical field by providing a unified framework for the distributed training of neural networks, which is able to leverage hybrid HPC and cloud environments in a transparent way for the user. The toolkit is composed of a Computer Vision library, a Deep Learning library, and a front-end for non-expert users; all of the components are focused on the medical domain, but they are general purpose and can be applied to any other field. In this paper, the principles driving the design of the DeepHealth libraries are described, along with details about the implementation and the interaction between the different elements composing the toolkit. Finally, experiments on common benchmarks prove the efficiency of each separate component and of the DeepHealth Toolkit overall
Analyzing European Deep-Learning libraries with Industry Standard Benchmark
For the past decade, machine learning (ML) has revolutionized numerous domains in our daily life. Nowadays, deep learning (DL) algorithms are the central focus of modern ML systems. As a result, we are witnessing an impressive surge of DL libraries and customized hardware designs that can efficiently support this computing-intensive techniques. The ever increasing development of new libraries and hardware designs rises a new challenge: how to create a fair comparison between the various implementations of DL libraries exploiting different hardware designs. Being able to compare these solutions is a key element to drive the design of new features and better exploit different hardware designs. To improve DL libraries, we need to adopt a common benchmark for the evaluation of the many DL frameworks in terms of accuracy and performance. A standardized benchmark would produce a fair and comprehensive evaluation of all the available methods, paving the way for better hardware-specific designs. However, benchmarking ML methods presents a number of unique challenges. First, some optimization techniques improve training accuracy at the cost of increasing "time to solution". Also, training is an stochastic process and "time to solution" depicts a high variability. Last, there are multiple software and hardware systems and it is not straightforward how to fairly evaluate them under the same parameters. MLPerf training benchmark defines guidelines to mitigate these challenges. MLPerf's mission is to build a fair and useful benchmarks for measuring performance of ML hardware, software, and services. The benchmark measures how fast systems can train models to a target quality. The first release (v0.5) contains seven different DL suites across four domains: (1) vision, (2) language, (3) commerce, and (4) research. Each of these suites is defined by a dataset, a quality target, and a reference implementation model. In this work, we present an image classification evaluation based on MLPerf benchmark for three DL libraries: Keras-TF, PyTorch, and the European Distributed Deep Learning Library (EDDLL). Keras and PyTorch have become very popular and widely-used libraries in the DL field. Meanwhile EDDLL is a novel library being developed as a part of the European project DeepHealth. Our goal is to perform a unified performance evaluation and characterization of these DL frameworks using the High Performance Computing (HPC) infrastructures of the BSC. The present work aims to help improving the EDDL library. We dive into the core implementation of the library to identify potential bottlenecks and explore optimization opportunities. This way, this work establishes the foundations for the EDDL CPU-optimized version, adapted to the HPC hardware infrastructure without compromising the accuracy results of the benchmark. As a result of this work, the EDDL CPU-optimized version achieves a 2x speedup over the baseline version, executed on Intel Xeon Platinum process
GenArchBench: A genomics benchmark suite for arm HPC processors
Arm usage has substantially grown in the High-Performance Computing (HPC) community. Japanese supercomputer Fugaku, powered by Arm-based A64FX processors, held the top position on the Top500 list between June 2020 and June 2022, currently sitting in the fourth position. The recently released 7th generation of Amazon EC2 instances for compute-intensive workloads (C7 g) is also powered by Arm Graviton3 processors. Projects like European Mont-Blanc and U.S. DOE/NNSA Astra are further examples of Arm irruption in HPC. In parallel, over the last decade, the rapid improvement of genomic sequencing technologies and the exponential growth of sequencing data has placed a significant bottleneck on the computational side. While most genomics applications have been thoroughly tested and optimized for x86 systems, just a few are prepared to perform efficiently on Arm machines. Moreover, these applications do not exploit the newly introduced Scalable Vector Extensions (SVE).
This paper presents GenArchBench, the first genome analysis benchmark suite targeting Arm architectures. We have selected computationally demanding kernels from the most widely used tools in genome data analysis and ported them to Arm-based A64FX and Graviton3 processors. Overall, the GenArch benchmark suite comprises 13 multi-core kernels from critical stages of widely-used genome analysis pipelines, including base-calling, read mapping, variant calling, and genome assembly. Our benchmark suite includes different input data sets per kernel (small and large), each with a corresponding regression test to verify the correctness of each execution automatically. Moreover, the porting features the usage of the novel Arm SVE instructions, algorithmic and code optimizations, and the exploitation of Arm-optimized libraries. We present the optimizations implemented in each kernel and a detailed performance evaluation and comparison of their performance on four different HPC machines (i.e., A64FX, Graviton3, Intel Xeon Skylake Platinum, and AMD EPYC Rome). Overall, the experimental evaluation shows that Graviton3 outperforms other machines on average. Moreover, we observed that the performance of the A64FX is significantly constrained by its small memory hierarchy and latencies. Additionally, as proof of concept, we study the performance of a production-ready tool that exploits two of the ported and optimized genomic kernels
The DeepHealth HPC Infrastructure
International audienceThis chapter presents the DeepHealth HPC toolkit for an efficient execution of deep learning (DL) medical application into HPC and cloud-computing infrastructures, featuring many-core, GPU, and FPGA acceleration devices. The toolkit offers to the European Computer Vision Library and the European Distributed Deep Learning Library (EDDL), developed in the DeepHealth project as well, the mechanisms to distribute and parallelize DL operations on HPC and cloud infrastructures in a fully transparent way. The toolkit implements workflow managers used to orchestrate HPC workloads for an efficient parallelization of EDDL training operations on HPC and cloud infrastructures, and includes the parallel programming models for an efficient execution EDDL inference and training operations on many-core, GPUs and FPGAs acceleration devices