15,719 research outputs found

    ALOJA: A framework for benchmarking and predictive analytics in Hadoop deployments

    Get PDF
    This article presents the ALOJA project and its analytics tools, which leverages machine learning to interpret Big Data benchmark performance data and tuning. ALOJA is part of a long-term collaboration between BSC and Microsoft to automate the characterization of cost-effectiveness on Big Data deployments, currently focusing on Hadoop. Hadoop presents a complex run-time environment, where costs and performance depend on a large number of configuration choices. The ALOJA project has created an open, vendor-neutral repository, featuring over 40,000 Hadoop job executions and their performance details. The repository is accompanied by a test-bed and tools to deploy and evaluate the cost-effectiveness of different hardware configurations, parameters and Cloud services. Despite early success within ALOJA, a comprehensive study requires automation of modeling procedures to allow an analysis of large and resource-constrained search spaces. The predictive analytics extension, ALOJA-ML, provides an automated system allowing knowledge discovery by modeling environments from observed executions. The resulting models can forecast execution behaviors, predicting execution times for new configurations and hardware choices. That also enables model-based anomaly detection or efficient benchmark guidance by prioritizing executions. In addition, the community can benefit from ALOJA data-sets and framework to improve the design and deployment of Big Data applications.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). This work is partially supported by the Ministry of Economy of Spain under contracts TIN2012-34557 and 2014SGR1051.Peer ReviewedPostprint (published version

    Main memory in HPC: do we need more, or could we live with less?

    Get PDF
    An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´c from High Performance Computing Center Stuttgart for their technical support.Postprint (published version

    Development of an oceanographic application in HPC

    Get PDF
    High Performance Computing (HPC) is used for running advanced application programs efficiently, reliably, and quickly. In earlier decades, performance analysis of HPC applications was evaluated based on speed, scalability of threads, memory hierarchy. Now, it is essential to consider the energy or the power consumed by the system while executing an application. In fact, the High Power Consumption (HPC) is one of biggest problems for the High Performance Computing (HPC) community and one of the major obstacles for exascale systems design. The new generations of HPC systems intend to achieve exaflop performances and will demand even more energy to processing and cooling. Nowadays, the growth of HPC systems is limited by energy issues Recently, many research centers have focused the attention on doing an automatic tuning of HPC applications which require a wide study of HPC applications in terms of power efficiency. In this context, this paper aims to propose the study of an oceanographic application, named OceanVar, that implements Domain Decomposition based 4D Variational model (DD-4DVar), one of the most commonly used HPC applications, going to evaluate not only the classic aspects of performance but also aspects related to power efficiency in different case of studies. These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology. In this work of thesis it was initially explained the concept of assimilation date, the context in which it is developed, and a brief description of the mathematical model 4DVAR. After this problem’s close examination, it was performed a porting from Matlab description of the problem of data-assimilation to its sequential version in C language. Secondly, after identifying the most onerous computational kernels in order of time, it has been developed a parallel version of the application with a parallel multiprocessor programming style, using the MPI (Message Passing Interface) protocol. The experiments results, in terms of performance, have shown that, in the case of running on HCA server, an Intel architecture, values of efficiency of the two most onerous functions obtained, growing the number of process, are approximately equal to 80%. In the case of running on ARM architecture, specifically on Thunder mini-cluster, instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can be explained by a more efficient use of resources (cache memory access) compared with the sequential case. In the second part of this paper was presented an analysis of the some issues of this application that has impact in the energy efficiency. After a brief discussion about the energy consumption characteristics of the Thunder chip in technological landscape, through the use of a power consumption detector, the Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were evaluated in order to determine an overview on the power-to-solution of this application to use as the basic standard for successive analysis with other parallel styles. Finally, a comprehensive performance evaluation, targeted to estimate the goodness of MPI parallelization, is conducted using a suitable performance tool named Paraver, developed by BSC. Paraver is such a performance analysis and visualisation tool which can be used to analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing. A set of graphical representation of these statistics make it easy for a developer to identify performance problems. Some of the problems that can be easily identified are load imbalanced decompositions, excessive communication overheads and poor average floating operations per second achieved. Paraver can also report statistics based on hardware counters, which are provided by the underlying hardware. This project aimed to use Paraver configuration files to allow certain metrics to be analysed for this application. To explain in some way the performance trend obtained in the case of analysis on the mini-cluster Thunder, the tracks were extracted from various case of studies and the results achieved is what expected, that is a drastic drop of cache misses by the case ppn (process per node) = 1 to case ppn = 16. This in some way explains a more efficient use of cluster resources with an increase of the number of processes

    “Dust in the wind...”, deep learning application to wind energy time series forecasting

    Get PDF
    To balance electricity production and demand, it is required to use different prediction techniques extensively. Renewable energy, due to its intermittency, increases the complexity and uncertainty of forecasting, and the resulting accuracy impacts all the different players acting around the electricity systems around the world like generators, distributors, retailers, or consumers. Wind forecasting can be done under two major approaches, using meteorological numerical prediction models or based on pure time series input. Deep learning is appearing as a new method that can be used for wind energy prediction. This work develops several deep learning architectures and shows their performance when applied to wind time series. The models have been tested with the most extensive wind dataset available, the National Renewable Laboratory Wind Toolkit, a dataset with 126,692 wind points in North America. The architectures designed are based on different approaches, Multi-Layer Perceptron Networks (MLP), Convolutional Networks (CNN), and Recurrent Networks (RNN). These deep learning architectures have been tested to obtain predictions in a 12-h ahead horizon, and the accuracy is measured with the coefficient of determination, the R² method. The application of the models to wind sites evenly distributed in the North America geography allows us to infer several conclusions on the relationships between methods, terrain, and forecasting complexity. The results show differences between the models and confirm the superior capabilities on the use of deep learning techniques for wind speed forecasting from wind time series data.Peer ReviewedPostprint (published version

    Integration of a parallel efficiency monitoring tool into an HPC production system

    Get PDF
    This thesis presents the design, implementation, and evaluation of an extension of a library called TALP for tracing useful computation and performance metrics in MPI-based applications (Message Passing Interface). The extension also integrates the tool with a web portal called UserPortal. They are both developed at the Barcelona Supercomputing Center (BSC). The library captures information about communication patterns and computation performed by MPI applications, and makes this information available to users. The extension developed in this project adds the functionality of reading PAPI (Performance Application Programming Interface) counters, allowing users to know the instructions per cycle of their application and identify bottlenecks in their code. UserPortal provides an easy-to-use interface for visualizing and analyzing the captured information and allows users to easily monitor the status of their jobs, such as memory usage, CPU usage, and their evolution over time. The integration of the library with the BSC system involves several stages of design and development, including a software wrapper, a modulefile, scripts for retrieving and processing data, and web development to display the data on the UserPortal. However, users must be educated in performance analysis in order to effectively make a good reading and interpretation of the reported metrics and optimize their codes. A public documentation has been developed as well as a reference on how to use these tools on BSC machines, along with links to educational resources on related topics. Overall, this work provides a valuable tool for developers and researchers working with MPI-based applications, making performance optimization more approachable and efficient.Aquesta tesi presenta el disseny, la implementació i l'avaluació d'una extensió d'una llibreria anomenada TALP per traçar el càlcul útil i les mètriques de rendiment en aplicacions basades en MPI (Message Programming Interface por les seves sigles en anglès). L'extensió també integra l'eina amb un portal web anomenat UserPortal. Tots dos són desenvolupats al Barcelona Supercomputing Center (BSC). La llibreria captura informació sobre patrons de comunicació i càlcul realitzat per aplicacions MPI i posa aquesta informació a disposició dels usuaris. L'extensió desenvolupada en aquest projecte afegeix la funcionalitat de llegir comptadors PAPI (Performance Application Programming Interface), permetent als usuaris saber les instruccions per cicle de la seva aplicació i identificar colls d'ampolla en el seu codi. UserPortal proporciona una interfície fàcil d'utilitzar per visualitzar i analitzar la informació capturada i permet als usuaris monitoritzar fàcilment l'estat dels seus treballs, com ara l'ús de la memòria, l'ús de la CPU i la seva evolució en el temps. La integració de la llibreria amb el sistema del BSC implica diverses etapes de disseny i desenvolupament, incloent un envolupador de programari, un modulefile, scripts per recuperar i processar dades i desenvolupament web per mostrar les dades en el UserPortal. No obstant, els usuaris han de ser educats en l'anàlisi de rendiment per tal de fer una lectura i interpretació efectiva de les mètriques reportades i ser capaços d'optimitzar els seus codis. S'ha desenvolupat una documentació pública així com una referència sobre com utilitzar aquestes eines en les màquines BSC, juntament amb enllaços a recursos educatius sobre temes relacionats. En general, aquest treball proporciona una eina valuosa per als desenvolupadors i investigadors que treballen amb aplicacions basades en MPI, fent que l'optimització del rendiment sigui més accessible i eficient

    Detecting Irregular Patterns in IoT Streaming Data for Fall Detection

    Full text link
    Detecting patterns in real time streaming data has been an interesting and challenging data analytics problem. With the proliferation of a variety of sensor devices, real-time analytics of data from the Internet of Things (IoT) to learn regular and irregular patterns has become an important machine learning problem to enable predictive analytics for automated notification and decision support. In this work, we address the problem of learning an irregular human activity pattern, fall, from streaming IoT data from wearable sensors. We present a deep neural network model for detecting fall based on accelerometer data giving 98.75 percent accuracy using an online physical activity monitoring dataset called "MobiAct", which was published by Vavoulas et al. The initial model was developed using IBM Watson studio and then later transferred and deployed on IBM Cloud with the streaming analytics service supported by IBM Streams for monitoring real-time IoT data. We also present the systems architecture of the real-time fall detection framework that we intend to use with mbientlabs wearable health monitoring sensors for real time patient monitoring at retirement homes or rehabilitation clinics.Comment: 7 page

    Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

    Get PDF
    Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

    Forum Session at the First International Conference on Service Oriented Computing (ICSOC03)

    Get PDF
    The First International Conference on Service Oriented Computing (ICSOC) was held in Trento, December 15-18, 2003. The focus of the conference ---Service Oriented Computing (SOC)--- is the new emerging paradigm for distributed computing and e-business processing that has evolved from object-oriented and component computing to enable building agile networks of collaborating business applications distributed within and across organizational boundaries. Of the 181 papers submitted to the ICSOC conference, 10 were selected for the forum session which took place on December the 16th, 2003. The papers were chosen based on their technical quality, originality, relevance to SOC and for their nature of being best suited for a poster presentation or a demonstration. This technical report contains the 10 papers presented during the forum session at the ICSOC conference. In particular, the last two papers in the report ere submitted as industrial papers
    corecore