7 research outputs found

    Multi-Node Advanced Performance and Power Analysis with Paraver

    Get PDF
    Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a preliminary performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. Moreover we show how the same analysis techniques are applicable on different architectures, analyzing the same HPC application running on two clusters, based respectively on Intel Haswell and Arm Cortex-A57 CPUs.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects, grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Universit`a degli Studi di Ferrara - dichiarazione dei redditi dell’anno 2014”.Peer ReviewedPostprint (author's final draft

    Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters

    Get PDF
    Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara-dichiarazione dei redditi dell’anno 2014”. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version

    Mixed convolutional and long short-term memory network for the detection of lethal ventricular arrhythmia

    Get PDF
    Early defibrillation by an automated external defibrillator (AED) is key for the survival of out-of-hospital cardiac arrest (OHCA) patients. ECG feature extraction and machine learning have been successfully used to detect ventricular fibrillation (VF) in AED shock decision algorithms. Recently, deep learning architectures based on 1D Convolutional Neural Networks (CNN) have been proposed for this task. This study introduces a deep learning architecture based on 1D-CNN layers and a Long Short-Term Memory (LSTM) network for the detection of VF. Two datasets were used, one from public repositories of Holter recordings captured at the onset of the arrhythmia, and a second from OHCA patients obtained minutes after the onset of the arrest. Data was partitioned patient-wise into training (80%) to design the classifiers, and test (20%) to report the results. The proposed architecture was compared to 1D-CNN only deep learners, and to a classical approach based on VF-detection features and a support vector machine (SVM) classifier. The algorithms were evaluated in terms of balanced accuracy (BAC), the unweighted mean of the sensitivity (Se) and specificity (Sp). The BAC, Se, and Sp of the architecture for 4-s ECG segments was 99.3%, 99.7%, and 98.9% for the public data, and 98.0%, 99.2%, and 96.7% for OHCA data. The proposed architecture outperformed all other classifiers by at least 0.3-points in BAC in the public data, and by 2.2-points in the OHCA data. The architecture met the 95% Sp and 90% Se requirements of the American Heart Association in both datasets for segment lengths as short as 3-s. This is, to the best of our knowledge, the most accurate VF detection algorithm to date, especially on OHCA data, and it would enable an accurate shock no shock diagnosis in a very short time.This study was supported by the Ministerio de Economía, Industria y Competitividad, Gobierno de España (ES) (TEC-2015-64678-R) to UI and EA and by Euskal Herriko Unibertsitatea (ES) (GIU17/031) to UI and EA. The funders, Tecnalia Research and Innovation and Banco Bilbao Vizcaya Argentaria (BBVA), provided support in the form of salaries for authors AP, AA, FAA, CF, EG, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the author contributions section

    Development of advanced control strategies for Adaptive Optics systems

    Get PDF
    Atmospheric turbulence is a fast disturbance that requires high control frequency. At the same time, celestial objects are faint sources of light and thus WFSs often work in a low photon count regime. These two conditions require a trade-off between high closed-loop control frequency to improve the disturbance rejection performance, and large WFS exposure time to gather enough photons for the integrated signal to increase the Signal-to-Noise ratio (SNR), making the control a delicate yet fundamental aspect for AO systems. The AO plant and atmospheric turbulence were formalized as state-space linear time-invariant systems. The full AO system model is the ground upon which a model-based control can be designed. A Shack-Hartmann wavefront sensor was used to measure the horizontal atmospheric turbulence. The experimental measurements yielded to the Cn2 atmospheric structure parameter, which is key to describe the turbulence statistics, and the Zernike terms time-series. Experimental validation shows that the centroid extraction algorithm implemented on the Jetson GPU outperforms (i.e. is faster) than the CPU implementation on the same hardware. In fact, due to the construction of the Shack-Hartmann wavefront sensor, the intensity image captured from its camera is partitioned into several sub-images, each related to a point of the incoming wavefront. Such sub-images are independent each-other and can be computed concurrently. The AO model is exploited to automatically design an advanced linear-quadratic Gaussian controller with integral action. Experimental evidence shows that the system augmentation approach outperforms the simple integrator and the integrator filtered with the Kalman predictor, and that it requires less parameters to tune

    Configuración y ejecución de algoritmos de visión artificial en la tarjeta Nvidia Jetson TK1 DevKit

    Get PDF
    En el presente Trabajo Fin de Grado (TFG) se aborda la evaluación de la tarjeta de desarrollo NVidia Jetson TK1. Se trata de una tarjeta orientada a la ejecución de algoritmos de visión artificial a través del cálculo en paralelo mediante la Unidad de Procesamiento Gráfico (GPU) de la tarjeta, que dispone de un SOC (System on a Chip) Tegra K1 el cual incluye una GPU NVidia Tegra y un microprocesador ARM Cortex A-15 entre otros periféricos. La evaluación de la tarjeta se lleva a cabo desde dos perspectivas diferentes. En primer lugar, se realiza un análisis a nivel de hardware para encontrar las ventajas y limitaciones para su uso en aplicaciones de visión artificial, en concreto, se evalúa el uso de las librerías de OpenCV para visión en estéreo, combinadas con un desarrollo de entorno gráfico en OpenGL. Posteriormente, se comparan los tiempos de ejecución de diferentes algoritmos para evaluar los distintos rendimientos de la tarjeta y de su GPU y CPU (Unidad Central de Proceso).This final degree thesis (TFG) adresses the evaluation of the NVidia Jetson TK1 development board. It is a board oriented to the execution of computer vision algorithms using paralel computing on the Graphics Processing Unit (GPU) integrated on the board. The Jetson TK1 includes Tegra K1 SOC (System on a Chip) that integrates a NVidia Tegra GPU and an ARM Cortex A-15 microprocessor among other peripherals. The evaluation of the development board is carried out from two different perspectives. First, a hardware level analisis is made in order to analyze the advantages and limitations for computer vision applications, specially those that use OpenCV libraries for stereo vision, combined with a OpenGL graphical environment. Then, computation cost are evaluated for different algorithms, so a comparaive of the performance can be made between GPU and CPU (Central Processing Unit).Grado en Ingeniería en Electrónica y Automática Industria

    Large Scale Constrained Trajectory Optimization Using Indirect Methods

    Get PDF
    State-of-the-art direct and indirect methods face significant challenges when solving large scale constrained trajectory optimization problems. Two main challenges when using indirect methods to solve such problems are difficulties in handling path inequality constraints, and the exponential increase in computation time as the number of states and constraints in problem increases. The latter challenge affects both direct and indirect methods. A methodology called the Integrated Control Regularization Method (ICRM) is developed for incorporating path constraints into optimal control problems when using indirect methods. ICRM removes the need for multiple constrained and unconstrained arcs and converts constrained optimal control problems into two-point boundary value problems. Furthermore, it also addresses the issue of transcendental control law equations by re-formulating the problem so that it can be solved by existing numerical solvers for two-point boundary value problems (TPBVP). The capabilities of ICRM are demonstrated by using it to solve some representative constrained trajectory optimization problems as well as a five vehicle problem with path constraints. Regularizing path constraints using ICRM represents a first step towards obtaining high quality solutions for highly constrained trajectory optimization problems which would generally be considered practically impossible to solve using indirect or direct methods. The Quasilinear Chebyshev Picard Iteration (QCPI) method builds on prior work and uses Chebyshev Polynomial series and the Picard Iteration combined with the Modified Quasi-linearization Algorithm. The method is developed specifically to utilize parallel computational resources for solving large TPBVPs. The capabilities of the numerical method are validated by solving some representative nonlinear optimal control problems. The performance of QCPI is benchmarked against single shooting and parallel shooting methods using a multi-vehicle optimal control problem. The results demonstrate that QCPI is capable of leveraging parallel computing architectures and can greatly benefit from implementation on highly parallel architectures such as GPUs. The capabilities of ICRM and QCPI are explored further using a five-vehicle constrained optimal control problem. The scenario models a co-operative, simultaneous engagement of two targets by five vehicles. The problem involves 3DOF dynamic models, control constraints for each vehicle and a no-fly zone path constraint. Trade studies are conducted by varying different parameters in the problem to demonstrate smooth transition between constrained and unconstrained arcs. Such transitions would be highly impractical to study using existing indirect methods. The study serves as a demonstration of the capabilities of ICRM and QCPI for solving large-scale trajectory optimization methods. An open source, indirect trajectory optimization framework is developed with the goal of being a viable contender to state-of-the-art direct solvers such as GPOPS and DIDO. The framework, named beluga, leverages ICRM and QCPI along with traditional indirect optimal control theory. In its current form, as illustrated by the various examples in this dissertation, it has made significant advances in automating the use of indirect methods for trajectory optimization. Following on the path of popular and widely used scientific software projects such as SciPy [1] and Numpy [2], beluga is released under the permissive MIT license [3]. Being an open source project allows the community to contribute freely to the framework, further expanding its capabilities and allow faster integration of new advances to the state-of-the-art

    Efficient Computing for Three-Dimensional Quantitative Phase Imaging

    Get PDF
    Quantitative Phase Imaging (QPI) is a powerful imaging technique for measuring the refractive index distribution of transparent objects such as biological cells and optical fibers. The quantitative, non-invasive approach of QPI provides preeminent advantages in biomedical applications and the characterization of optical fibers. Tomographic Deconvolution Phase Microscopy (TDPM) is a promising 3D QPI method that combines diffraction tomography, deconvolution, and through-focal scanning with object rotation to achieve isotropic spatial resolution. However, due to the large data size, 3D TDPM has a drawback in that it requires extensive computation power and time. In order to overcome this shortcoming, CPU/GPU parallel computing and application-specific embedded systems can be utilized. In this research, OpenMP Tasking and CUDA Streaming with Unified Memory (TSUM) is proposed to speed up the tomographic angle computations in 3D TDPM. TSUM leverages CPU multithreading and GPU computing on a System on a Chip (SoC) with unified memory. Unified memory eliminates data transfer between CPU and GPU memories, which is a major bottleneck in GPU computing. This research presents a speedup of 3D TDPM with TSUM for a large dataset and demonstrates the potential of TSUM in realizing real-time 3D TDPM.M.S
    corecore