63 research outputs found

    Learning workload behaviour models from monitored time-series for resource estimation towards data center optimization

    Get PDF
    In recent years there has been an extraordinary growth of the demand of Cloud Computing resources executed in Data Centers. Modern Data Centers are complex systems that need management. As distributed computing systems grow, and workloads benefit from such computing environments, the management of such systems increases in complexity. The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as "black boxes", where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. To deal with such complexity, Machine Learning methods become crucial to facilitate tasks that can be automatically learned from data. Firstly, this thesis proposes an unsupervised learning technique to learn high level representations from workload traces. Such technique provides a fast methodology to characterize workloads as sequences of abstract phases. The learned phase representation is validated on a variety of datasets and used in an auto-scaling task where we show that it can be applied in a production environment, achieving better performance than other state-of-the-art techniques. Secondly, this thesis proposes a neural architecture, based on Sequence-to-Sequence models, that provides the expected resource usage of applications sharing hardware resources. The proposed technique provides resource managers the ability to predict resource usage over time as well as the completion time of the running applications. The technique provides lower error predicting usage when compared with other popular Machine Learning methods. Thirdly, this thesis proposes a technique for auto-tuning Big Data workloads from the available tunable parameters. The proposed technique gathers information from the logs of an application generating a feature descriptor that captures relevant information from the application to be tuned. Using this information we demonstrate that performance models can generalize up to a 34% better when compared with other state-of-the-art solutions. Moreover, the search time to find a suitable solution can be drastically reduced, with up to a 12x speedup and almost equal quality results as modern solutions. These results prove that modern learning algorithms, with the right feature information, provide powerful techniques to manage resource allocation for applications running in cloud environments. This thesis demonstrates that learning algorithms allow relevant optimizations in Data Center environments, where applications are externally monitored and careful resource management is paramount to efficiently use computing resources. We propose to demonstrate this thesis in three areas that orbit around resource management in server environmentsEls Centres de Dades (Data Centers) moderns són sistemes complexos que necessiten ser gestionats. A mesura que creixen els sistemes de computació distribuïda i les aplicacions es beneficien d’aquestes infraestructures, també n’augmenta la seva complexitat. La complexitat que implica gestionar recursos de còmput i d’energia en sistemes de computació al núvol fa difícil entendre el comportament de les aplicacions que s'executen de manera manual. Aquesta dificultat s’incrementa quan les aplicacions s'observen com a "caixes negres", on només es poden monitoritzar algunes mètriques de les caixes de manera externa. A més, degut a la gran varietat d’escenaris i aplicacions, és necessari automatitzar la gestió d'aquests recursos. Per afrontar-ne el repte, l'aprenentatge automàtic juga un paper cabdal que facilita aquestes tasques, que poden ser apreses automàticament en base a dades prèvies del sistema que es monitoritza. Aquesta tesi demostra que els algorismes d'aprenentatge poden aportar optimitzacions molt rellevants en la gestió de Centres de Dades, on les aplicacions són monitoritzades externament i la gestió dels recursos és de vital importància per a fer un ús eficient de la capacitat de còmput d'aquests sistemes. En primer lloc, aquesta tesi proposa emprar aprenentatge no supervisat per tal d’aprendre representacions d'alt nivell a partir de traces d'aplicacions. Aquesta tècnica ens proporciona una metodologia ràpida per a caracteritzar aplicacions vistes com a seqüències de fases abstractes. La representació apresa de fases és validada en diferents “datasets” i s'aplica a la gestió de tasques d'”auto-scaling”, on es conclou que pot ser aplicable en un medi de producció, aconseguint un millor rendiment que altres mètodes de vanguardia. En segon lloc, aquesta tesi proposa l'ús de xarxes neuronals, basades en arquitectures “Sequence-to-Sequence”, que proporcionen una estimació dels recursos usats per aplicacions que comparteixen recursos de hardware. La tècnica proposada facilita als gestors de recursos l’habilitat de predir l'ús de recursos a través del temps, així com també una estimació del temps de còmput de les aplicacions. Tanmateix, redueix l’error en l’estimació de recursos en comparació amb d’altres tècniques populars d'aprenentatge automàtic. Per acabar, aquesta tesi introdueix una tècnica per a fer “auto-tuning” dels “hyper-paràmetres” d'aplicacions de Big Data. Consisteix així en obtenir informació dels “logs” de les aplicacions, generant un vector de característiques que captura informació rellevant de les aplicacions que s'han de “tunejar”. Emprant doncs aquesta informació es valida que els ”Regresors” entrenats en la predicció del rendiment de les aplicacions són capaços de generalitzar fins a un 34% millor que d’altres “Regresors” de vanguàrdia. A més, el temps de cerca per a trobar una bona solució es pot reduir dràsticament, aconseguint un increment de millora de fins a 12 vegades més dels resultats de qualitat en contraposició a alternatives modernes. Aquests resultats posen de manifest que els algorismes moderns d'aprenentatge automàtic esdevenen tècniques molt potents per tal de gestionar l'assignació de recursos en aplicacions que s'executen al núvol.Arquitectura de computador

    Weighted Contrastive Divergence

    Get PDF
    Learning algorithms for energy based Boltzmann architectures that rely on gradient descent are in general computationally prohibitive, typically due to the exponential number of terms involved in computing the partition function. In this way one has to resort to approximation schemes for the evaluation of the gradient. This is the case of Restricted Boltzmann Machines (RBM) and its learning algorithm Contrastive Divergence (CD). It is well-known that CD has a number of shortcomings, and its approximation to the gradient has several drawbacks. Overcoming these defects has been the basis of much research and new algorithms have been devised, such as persistent CD. In this manuscript we propose a new algorithm that we call Weighted CD (WCD), built from small modifications of the negative phase in standard CD. However small these modifications may be, experimental work reported in this paper suggest that WCD provides a significant improvement over standard CD and persistent CD at a small additional computational cost

    Discovering ship navigation patterns towards environmental impact modeling

    Get PDF
    In this work a data pipe-line to manage and extract patterns from time-series is described. The patterns found with a combination of Conditional Restricted Boltzmann Machine (CRBM) and k-Means algorithms are then validated using a visualization tool. The motivation of finding these patterns is to leverage future emission model

    Sequence-to-sequence models for workload interference prediction on batch processing datacenters

    Get PDF
    Co-scheduling of jobs in data centers is a challenging scenario where jobs can compete for resources, leading to severe slowdowns or failed executions. Efficient job placement on environments where resources are shared requires awareness on how jobs interfere during execution, to go far beyond ineffective resource overbooking techniques. Current techniques, most of which already involve machine learning and job modeling, are based on workload behavior summarization over time, rather than focusing on effective job requirements at each instant of the execution. In this work, we propose a methodology for modeling co-scheduling of jobs on data centers, based on their behavior towards resources and execution time and using sequence-to-sequence models based on recurrent neural networks. The goal is to forecast co-executed jobs footprint on resources throughout their execution time, from the profile shown by the individual jobs, in order to enhance resource manager and scheduler placement decisions. The methods presented herein are validated by using High Performance Computing benchmarks based on different frameworks (such as Hadoop and Spark) and applications (CPU bound, IO bound, machine learning, SQL queries...). Experiments show that the model can correctly identify the resource usage trends from previously seen and even unseen co-scheduled jobs.This work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 639595); [Generalitat] de Catalunya under contract 2014SGR1051; the ICREA Academia program; and the BSC-CNS Severo Ochoa program (SEV-2015-0493); the Spanish Ministry of Economy under contract TIN2015-65316-P and the Generalitat.Peer ReviewedPostprint (author's final draft

    Improving Maritime Traffic Emission Estimations on Missing Data with CRBMs

    Full text link
    Maritime traffic emissions are a major concern to governments as they heavily impact the Air Quality in coastal cities. Ships use the Automatic Identification System (AIS) to continuously report position and speed among other features, and therefore this data is suitable to be used to estimate emissions, if it is combined with engine data. However, important ship features are often inaccurate or missing. State-of-the-art complex systems, like CALIOPE at the Barcelona Supercomputing Center, are used to model Air Quality. These systems can benefit from AIS based emission models as they are very precise in positioning the pollution. Unfortunately, these models are sensitive to missing or corrupted data, and therefore they need data curation techniques to significantly improve the estimation accuracy. In this work, we propose a methodology for treating ship data using Conditional Restricted Boltzmann Machines (CRBMs) plus machine learning methods to improve the quality of data passed to emission models. Results show that we can improve the default methods proposed to cover missing data. In our results, we observed that using our method the models boosted their accuracy to detect otherwise undetectable emissions. In particular, we used a real data-set of AIS data, provided by the Spanish Port Authority, to estimate that thanks to our method, the model was able to detect 45% of additional emissions, of additional emissions, representing 152 tonnes of pollutants per week in Barcelona and propose new features that may enhance emission modeling.Comment: 12 pages, 7 figures. Postprint accepted manuscript, find the full version at Engineering Applications of Artificial Intelligence (https://doi.org/10.1016/j.engappai.2020.103793

    Head-to-head comparison of contemporary heart failure risk scores

    Get PDF
    Abstract Aims Several heart failure (HF) web-based risk scores are currently used in clinical practice. Currently, we lack head-to-head comparison of the accuracy of risk scores. This study aimed to assess correlation and mortality prediction performance of Meta-Analysis Global Group in Chronic Heart Failure (MAGGIC-HF) risk score, which includes clinical variables + medications; Seattle Heart Failure Model (SHFM), which includes clinical variables + treatments + analytes; PARADIGM Risk of Events and Death in the Contemporary Treatment of Heart Failure (PREDICT-HF) and Barcelona Bio-Heart Failure (BCN-Bio-HF) risk calculator, which also include biomarkers, like N-terminal pro B-type natriuretic peptide (NT-proBNP). Methods and results A total of 1166 consecutive patients with HF from different aetiologies that had NT-proBNP measurement at first visit were included. Discrimination for all-cause mortality was compared by Harrell's C-statistic from 1 to 5?years, when possible. Calibration was assessed by calibration plots and Hosmer?Lemeshow test and global performance by Nagelkerke's R2. Correlation between scores was assessed by Spearman rank test. Correlation between the scores was relatively poor (rho value from 0.66 to 0.79). Discrimination analyses showed better results for 1-year mortality than for longer follow-up (SHFM 0.817, MAGGIC-HF 0.801, PREDICT-HF 0.799, BCN-Bio-HF 0.830). MAGGIC-HF showed the best calibration, BCN-Bio-HF overestimated risk while SHFM and PREDICT-HF underestimated it. BCN-Bio-HF provided the best discrimination and overall performance at every time-point. Conclusions None of the contemporary risk scores examined showed a clear superiority over the rest. BCN-Bio-HF calculator provided the best discrimination and overall performance with overestimation of risk. MAGGIC-HF showed the best calibration, and SHFM and PREDICT-HF tended to underestimate risk. Regular updating and recalibration of online web calculators seems necessary to improve their accuracy as HF management evolves at unprecedented paceFunding The NT-proBNP assays were partially provided by Roche Diagnostics. Roche Diagnostics had no role in the design of the study or the collection, management, analysis, or interpretation of the data.Peer Reviewed"Article signat per 24 autors/es: Pau Codina,Josep Lupón,Andrea Borrellas,Giosafat Spitaleri,Germán Cediel,Mar Domingo,Joanne Simpson,Wayne C. Levy,Evelyn Santiago-Vacas,Elisabet Zamora,David Buchaca,Isaac Subirana,Javier Santesmases,Crisanto Diez-Quevedo,Maria I. Troya,Maria Boldo,Salvador Altmir,Nuria Alonso,Beatriz González,Carmen Rivas,Julio Nuñez,John McMurray,Antoni Bayes-Genis"Postprint (published version

    Stopping criteria in contrastive divergence: Alternatives to the reconstruction error

    Get PDF
    Restricted Boltzmann Machines (RBMs) are general unsupervised learning devices to ascertain generative models of data distributions. RBMs are often trained using the Contrastive Divergence learning algorithm (CD), an approximation to the gradient of the data log-likelihood. A simple reconstruction error is often used to decide whether the approximation provided by the CD algorithm is good enough, though several authors (Schulz et al., 2010; Fischer & Igel, 2010) have raised doubts concerning the feasibility of this procedure. However, not many alternatives to the reconstruction error have been used in the literature. In this manuscript we investigate simple alternatives to the reconstruction error in order to detect as soon as possible the decrease in the log-likelihood during learning.Peer ReviewedPostprint (published version

    A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL

    Get PDF
    © 2020 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Conditional Restricted Boltzmann Machine (CRBM) is a promising candidate for a multidimensional system modeling that can learn a probability distribution over a set of data. It is a specific type of an artificial neural network with one input (visible) and one output (hidden) layer. Recently published works demonstrate that CRBM is a suitable mechanism for modeling multidimensional time series such as human motion, workload characterization, city traffic analysis. The process of learning and inference of these systems relies on linear algebra functions like matrix–matrix multiplication, and for higher data sets, they are very compute-intensive. In this paper, we present a configurable framework for CRBM based workloads for arbitrary large models. We show how to accelerate the learning process of CRBM with FPGAs and OpenCL, and we conduct an extensive scalability study for different model sizes and system configurations. We show significant improvement in performance/Watt for large models and batch sizes (from 1.51x up to 5.71x depending on the host configuration) when we use FPGA and OpenCL for the acceleration, and limited benefits for small models comparing to the state-of-the-art CPU solution.This work was supported by the European Research Council(ERC) under the European Union’s Horizon 2020 research andinnovation programme (grant agreements No 639595); the Min-istry of Economy of Spain under contract TIN2015-65316-P andGeneralitat de Catalunya, Spain under contract 2014SGR1051;the ICREA, Spain Academia program; the BSC-CNS Severo Ochoaprogram, Spain (SEV-2015-0493) and Intel Corporation, UnitedStatesPeer ReviewedPostprint (published version

    Automatic generation of workload profiles using unsupervised learning pipelines

    Get PDF
    The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as “black boxes”, where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. Here we examine and model application behavior by finding behavior phases. We use Conditional Restricted Boltzmann Machines (CRBM) to model time-series containing resources traces measurements like CPU, Memory and IO. CRBMs can be used to map a given given historic window of trace behaviour into a single vector. This low dimensional and time-aware vector can be passed through clustering methods, from simplistic ones like k-means to more complex ones like those based on Hidden Markov Models (HMM). We use these methods to find phases of similar behaviour in the workloads. Our experimental evaluation shows that the proposed method is able to identify different phases of resource consumption across different workloads. We show that the distinct phases contain specific resource patterns that distinguish them.Peer ReviewedPostprint (published version
    • …
    corecore