137 research outputs found

    D5.1: Accelerator Deployment Models

    Get PDF
    In this deliverable, we explore this question by studying accelerator deployment models. Under accelerator, we understand for example application-specific GPUs or specially programmed FPGAs. A deployment specifies types, amount, and connectivity of accelerators in a datacenter. With these definitions in mind, we created a theoretical model of the datacenter, its components, expected workloads, and finally, it is possible deployments. We have developed VineSim, a software simulator of a datacenter, based on the aforementioned theoretical modeling. VineSim takes as inputs a workload and a deployment description and outputs performance metrics of interest, such as job latency and resource utilization. In VineSim, one can configure several parameters, including how tasks are allocated to nodes, and estimations of how fast they execute on different accelerators. VineSim can be used to explore how different deployments respond to different kinds of workloads, thus allowing one to determine how to best compose a datacenter based on particular workload, performance, or budgeting requirements

    Commodity single board computer clusters and their applications

    Get PDF
    © 2018 Current commodity Single Board Computers (SBCs) are sufficiently powerful to run mainstream operating systems and workloads. Many of these boards may be linked together, to create small, low-cost clusters that replicate some features of large data center clusters. The Raspberry Pi Foundation produces a series of SBCs with a price/performance ratio that makes SBC clusters viable, perhaps even expendable. These clusters are an enabler for Edge/Fog Compute, where processing is pushed out towards data sources, reducing bandwidth requirements and decentralizing the architecture. In this paper we investigate use cases driving the growth of SBC clusters, we examine the trends in future hardware developments, and discuss the potential of SBC clusters as a disruptive technology. Compared to traditional clusters, SBC clusters have a reduced footprint, are low-cost, and have low power requirements. This enables different models of deployment—particularly outside traditional data center environments. We discuss the applicability of existing software and management infrastructure to support exotic deployment scenarios and anticipate the next generation of SBC. We conclude that the SBC cluster is a new and distinct computational deployment paradigm, which is applicable to a wider range of scenarios than current clusters. It facilitates Internet of Things and Smart City systems and is potentially a game changer in pushing application logic out towards the network edge

    HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges

    Full text link
    High Performance Computing (HPC) clouds are becoming an alternative to on-premise clusters for executing scientific applications and business analytics services. Most research efforts in HPC cloud aim to understand the cost-benefit of moving resource-intensive applications from on-premise environments to public cloud platforms. Industry trends show hybrid environments are the natural path to get the best of the on-premise and cloud resources---steady (and sensitive) workloads can run on on-premise resources and peak demand can leverage remote resources in a pay-as-you-go manner. Nevertheless, there are plenty of questions to be answered in HPC cloud, which range from how to extract the best performance of an unknown underlying platform to what services are essential to make its usage easier. Moreover, the discussion on the right pricing and contractual models to fit small and large users is relevant for the sustainability of HPC clouds. This paper brings a survey and taxonomy of efforts in HPC cloud and a vision on what we believe is ahead of us, including a set of research challenges that, once tackled, can help advance businesses and scientific discoveries. This becomes particularly relevant due to the fast increasing wave of new HPC applications coming from big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR

    Improving the Performance of Cloud-based Scientific Services

    No full text
    Cloud computing provides access to a large scale set of readily available computing resources at the click of a button. The cloud paradigm has commoditised computing capacity and is often touted as a low-cost model for executing and scaling applications. However, there are significant technical challenges associated with selecting, acquiring, configuring, and managing cloud resources which can restrict the efficient utilisation of cloud capabilities. Scientific computing is increasingly hosted on cloud infrastructure—in which scientific capabilities are delivered to the broad scientific community via Internet-accessible services. This migration from on-premise to on-demand cloud infrastructure is motivated by the sporadic usage patterns of scientific workloads and the associated potential cost savings without the need to purchase, operate, and manage compute infrastructure—a task that few scientific users are trained to perform. However, cloud platforms are not an automatic solution. Their flexibility is derived from an enormous number of services and configuration options, which in turn result in significant complexity for the user. In fact, naïve cloud usage can result in poor performance and excessive costs, which are then directly passed on to researchers. This thesis presents methods for developing efficient cloud-based scientific services. Three real-world scientific services are analysed and a set of common requirements are derived. To address these requirements, this thesis explores automated and scalable methods for inferring network performance, considers various trade-offs (e.g., cost and performance) when provisioning instances, and profiles application performance, all in heterogeneous and dynamic cloud environments. Specifically, network tomography provides the mechanisms to infer network performance in dynamic and opaque cloud networks; cost-aware automated provisioning approaches enable services to consider, in real-time, various trade-offs such as cost, performance, and reliability; and automated application profiling allows a huge search space of applications, instance types, and configurations to be analysed to determine resource requirements and application performance. Finally, these contributions are integrated into an extensible and modular cloud provisioning and resource management service called SCRIMP. Cloud-based scientific applications and services can subscribe to SCRIMP to outsource their provisioning, usage, and management of cloud infrastructures. Collectively, the approaches presented in this thesis are shown to provide order of magnitude cost savings and significant performance improvement when employed by production scientific services

    Optimization of deep learning algorithms for an autonomous RC vehicle

    Get PDF
    Dissertação de mestrado em Engenharia InformĂĄticaThis dissertation aims to evaluate and improve the performance of deep learning (DL) algorithms to autonomously drive a vehicle, using a Remo Car (an RC vehicle) as testbed. The RC vehicle was built with a 1:10 scaled remote controlled car and fitted with an embedded system and a video camera to capture and process real-time image data. Two different embedded systems were comparatively evaluated: an homogeneous system, a Raspberry Pi 4, and an heterogeneous system, a NVidia Jetson Nano. The Raspberry Pi 4 with an advanced 4-core ARM device supports multiprocessing, while the Jetson Nano, also with a 4-core ARM device, has an integrated accelerator, a 128 CUDA-core NVidia GPU. The captured video is processed with convolutional neural networks (CNNs), which interpret image data of the vehicle’s surroundings and predict critical data, such as lane view and steering angle, to provide mechanisms to drive on its own, following a predefined path. To improve the driving performance of the RC vehicle, this work analysed the programmed DL algorithms, namely different computer vision approaches for object detection and image classification, aiming to explore DL techniques and improve their performance at the inference phase. The work also analysed the computational efficiency of the control software, while running intense and complex deep learning tasks in the embedded devices, and fully explored the advanced characteristics and instructions provided by the two embedded systems in the vehicle. Different machine learning (ML) libraries and frameworks were analysed and evaluated: TensorFlow, TensorFlow Lite, Arm NN, PyArmNN and TensorRT. They play a key role to deploy the relevant algorithms and to fully engage the hardware capabilities. The original algorithm was successfully optimized and both embedded systems could perfectly handle this workload. To understand the computational limits of both devices, an additional and heavy DL algorithm was developed that aimed to detect traffic signs. The homogeneous system, the Raspberry Pi 4, could not deliver feasible low-latency values, hence the detection of traffic signs was not possible in real-time. However, a great performance improvement was achieved using the heterogeneous system, Jetson Nano, enabling their CUDA-cores to process the additional workload.Esta dissertação tem como objetivo avaliar e melhorar o desempenho de algoritmos de deep learning (DL) orientados Ă  condução autĂłnoma de veĂ­culos, usando um carro controlado remotamente como ambiente de teste. O carro foi construĂ­do usando um modelo de um veĂ­culo de controlo remoto de escala 1:10, onde foi colocado um sistema embebido e uma cĂąmera de vĂ­deo para capturar e processar imagem em tempo real. Dois sistemas embebidos foram comparativamente avaliados: um sistema homogĂ©neo, um Raspberry Pi 4, e um sistema heterogĂ©neo, uma NVidia Jetson Nano. O Raspberry Pi 4 possui um processador ARM com 4 nĂșcleos, suportando multiprocessamento. A Jetson Nano, tambĂ©m com um processador ARM de 4 nĂșcleos, possui uma unidade adicional de processamento com 128 nĂșcleos do tipo CUDA-core. O vĂ­deo capturado e processado usando redes neuronais convolucionais (CNN), interpretando o meio envolvente do veĂ­culo e prevendo dados cruciais, como a visibilidade da linha da estrada e o angulo de direção, de forma a que o veĂ­culo consiga conduzir de forma autĂłnoma num determinado ambiente. De forma a melhorar o desempenho da condução autĂłnoma do veĂ­culo, diferentes algoritmos de deep learning foram analisados, nomeadamente diferentes abordagens de visĂŁo por computador para detecção e classificação de imagens, com o objetivo de explorar tĂ©cnicas de CNN e melhorar o seu desempenho na fase de inferĂȘncia. A dissertação tambĂ©m analisou a eficiĂȘncia computacional do software usado para a execução de tarefas de aprendizagem profunda intensas e complexas nos dispositivos embebidos, e explorou completamente as caracterĂ­sticas avançadas e as instruçÔes fornecidas pelos dois sistemas embebidos no veĂ­culo. Diferentes bibliotecas e frameworks de machine learning foram analisadas e avaliadas: TensorFlow, TensorFlow Lite, Arm NN, PyArmNN e TensorRT. Estes desempenham um papel fulcral no provisionamento dos algoritmos de deep learning para tirar mĂĄximo partido das capacidades do hardware usado. O algoritmo original foi otimizado com sucesso e ambos os sistemas embebidos conseguiram executar os algoritmos com pouco esforço. Assim, para entender os limites computacionais de ambos os dispositivos, um algoritmo adicional mais complexo de deep learning foi desenvolvido com o objetivo de detectar sinais de transito. O sistema homogĂ©neo, o Raspberry Pi 4, nĂŁo conseguiu entregar valores viĂĄveis de baixa latĂȘncia, portanto, a detecção de sinais de trĂąnsito nĂŁo foi possĂ­vel em tempo real, usando este sistema. No entanto, foi alcançada uma grande melhoria de desempenho usando o sistema heterogeneo, Jetson Nano, que usaram os seus nĂșcleos CUDA adicionais para processar a carga computacional mais intensa

    EDEN: A high-performance, general-purpose, NeuroML-based neural simulator

    Get PDF
    Modern neuroscience employs in silico experimentation on ever-increasing and more detailed neural networks. The high modelling detail goes hand in hand with the need for high model reproducibility, reusability and transparency. Besides, the size of the models and the long timescales under study mandate the use of a simulation system with high computational performance, so as to provide an acceptable time to result. In this work, we present EDEN (Extensible Dynamics Engine for Networks), a new general-purpose, NeuroML-based neural simulator that achieves both high model flexibility and high computational performance, through an innovative model-analysis and code-generation technique. The simulator runs NeuroML v2 models directly, eliminating the need for users to learn yet another simulator-specific, model-specification language. EDEN's functional correctness and computational performance were assessed through NeuroML models available on the NeuroML-DB and Open Source Brain model repositories. In qualitative experiments, the results produced by EDEN were verified against the established NEURON simulator, for a wide range of models. At the same time, computational-performance benchmarks reveal that EDEN runs up to 2 orders-of-magnitude faster than NEURON on a typical desktop computer, and does so without additional effort from the user. Finally, and without added user effort, EDEN has been built from scratch to scale seamlessly over multiple CPUs and across computer clusters, when available.Comment: 29 pages, 9 figure

    Architecting Data Centers for High Efficiency and Low Latency

    Full text link
    Modern data centers, housing remarkably powerful computational capacity, are built in massive scales and consume a huge amount of energy. The energy consumption of data centers has mushroomed from virtually nothing to about three percent of the global electricity supply in the last decade, and will continuously grow. Unfortunately, a significant fraction of this energy consumption is wasted due to the inefficiency of current data center architectures, and one of the key reasons behind this inefficiency is the stringent response latency requirements of the user-facing services hosted in these data centers such as web search and social networks. To deliver such low response latency, data center operators often have to overprovision resources to handle high peaks in user load and unexpected load spikes, resulting in low efficiency. This dissertation investigates data center architecture designs that reconcile high system efficiency and low response latency. To increase the efficiency, we propose techniques that understand both microarchitectural-level resource sharing and system-level resource usage dynamics to enable highly efficient co-locations of latency-critical services and low-priority batch workloads. We investigate the resource sharing on real-system simultaneous multithreading (SMT) processors to enable SMT co-locations by precisely predicting the performance interference. We then leverage historical resource usage patterns to further optimize the task scheduling algorithm and data placement policy to improve the efficiency of workload co-locations. Moreover, we introduce methodologies to better manage the response latency by automatically attributing the source of tail latency to low-level architectural and system configurations in both offline load testing environment and online production environment. We design and develop a response latency evaluation framework at microsecond-level precision for data center applications, with which we construct statistical inference procedures to attribute the source of tail latency. Finally, we present an approach that proactively enacts carefully designed causal inference micro-experiments to diagnose the root causes of response latency anomalies, and automatically correct them to reduce the response latency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144144/1/yunqi_1.pd

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
    • 

    corecore