947 research outputs found

    Performance Modeling of Virtualized Custom Logic Computations

    Get PDF
    Virtualization of custom logic computations (i.e., by sharing a fixed function across distinct data streams) provides a means of reusing hardware resources, particularly when resources are limited. This is common practice in traditional processors where more than one user can share processor resources. In this paper, we virtualize a custom logic block using C-slow techniques to support fine-grain context-switching. We then develop and present an analytic model for several performance measures (throughput, latency, input queue occupancy) for both fine-grained and coarse-grained context switching (to a secondary memory). Next, we calibrate the analytic performance model with empirical measurements. We then validate the model via discrete-event simulation and use the model to predict the performance and develop optimal schedules for virtualized logic computations. We present results for a Taylor series expansion of a cosine function with added feedback and an AES encryption cipher

    Utilizing Magnetic Tunnel Junction Devices in Digital Systems

    Get PDF
    The research described in this dissertation is motivated by the desire to effectively utilize magnetic tunnel junctions (MTJs) in digital systems. We explore two aspects of this: (1) a read circuit useful for global clocking and magnetologic, and (2) hardware virtualization that utilizes the deeply-pipelined nature of magnetologic. In the first aspect, a read circuit is used to sense the state of an MTJ (low or high resistance) and produce a logic output that represents this state. With global clocking, an external magnetic field combined with on-chip MTJs is used as an alternative mechanism for distributing the clock signal across the chip. With magnetologic, logic is evaluated with MTJs that must be sensed by a read circuit and used to drive downstream logic. For these two uses, we develop a resistance-to-voltage (R2V) read circuit to sense MTJ resistance and produce a logic voltage output. We design and fabricate a prototype test chip in the 3 metal 2 poly 0.5 um process for testing the R2V read circuit and experimentally validating its correctness. Using a clocked low/high resistor pair, we show that the read circuit can correctly detect the input resistance and produce the desired square wave output. The read circuit speed is measured to operate correctly up to 48 MHz. The input node is relatively insensitive to node capacitance and can handle up to 10s of pF of capacitance without changing the bandwidth of the circuit. In the second aspect, hardware virtualization is a technique by which deeply-pipelined circuits that have feedback can be utilized. MTJs have the potential to act as state in a magnetologic circuit which may result in a deep pipeline. Streams of computation are then context switched into the hardware logic, allowing them to share hardware resources and more fully utilize the pipeline stages of the logic. While applicable to magnetologic using MTJs, virtualization is also applicable to traditional logic technologies like CMOS. Our investigation targets MTJs, FPGAs, and ASICs. We develop M/D/1 and M/G/1 queueing models of the performance of virtualized hardware with secondary memory using a fixed, hierarchical, round-robin schedule that predict average throughput, latency, and queue occupancy in the system. We develop three C-slow applications and calibrate them to a clock and resource model for FPGA and ASIC technologies. Last, using the M/G/1 model, we predict throughput, latency, and resource usage for MTJ, FPGA, and ASIC technologies. We show three design scenarios illustrating ways in which to use the model

    Next generation of Exascale-class systems: ExaNeSt project and the status of its interconnect and storage development

    Get PDF
    The ExaNeSt project started on December 2015 and is funded by EU H2020 research framework (call H2020-FETHPC-2014, n. 671553) to study the adoption of low-cost, Linux-based power-efficient 64-bit ARM processors clusters for Exascale-class systems. The ExaNeSt consortium pools partners with industrial and academic research expertise in storage, interconnects and applications that share a vision of an European Exascale-class supercomputer. The common goal is designing and implementing a physical rack prototype together with its cooling system, the non-volatile memory (NVM) architecture and a unified low-latency interconnect able to test different options for network and storage. Furthermore, the consortium goal is to provide real HPC applications to validate the system. In this paper we describe the unified data and storage network architecture, reporting on the status of development of different testbeds and highlighting preliminary benchmark results obtained through the execution of scientific, engineering and data analytics scalable application kernels

    A Distributed Workflow Platform for Simulation

    Get PDF
    Best Paper AwardInternational audienceThis paper presents an approach to design, implement and deploy a simulation platform based on distributed workflows. It supports the smooth integration of existing software, e.g. Matlab, Scilab, Python, OpenFOAM, Paraview and user-defined programs. The contribution of the paper is a new feature which supports application-level fault-tolerance and exception-handling, i.e., resilience.Cet article présente une approche pour concevoir, réaliser et déployer une plateforme de simulation basée sur les workflows distribués. Elle permet l'intégration de logiciels existant, par exemple Matlab, Scilab, Python, OpenFOAM, Paraview et de programmes définis par les utilisateurs. La contribution est ici le support de la tolérance aux pannes par les applications et le traitement des exceptions, c-à-d la résilience

    Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

    Get PDF
    Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

    Revisiting Actor Programming in C++

    Full text link
    The actor model of computation has gained significant popularity over the last decade. Its high level of abstraction makes it appealing for concurrent applications in parallel and distributed systems. However, designing a real-world actor framework that subsumes full scalability, strong reliability, and high resource efficiency requires many conceptual and algorithmic additives to the original model. In this paper, we report on designing and building CAF, the "C++ Actor Framework". CAF targets at providing a concurrent and distributed native environment for scaling up to very large, high-performance applications, and equally well down to small constrained systems. We present the key specifications and design concepts---in particular a message-transparent architecture, type-safe message interfaces, and pattern matching facilities---that make native actors a viable approach for many robust, elastic, and highly distributed developments. We demonstrate the feasibility of CAF in three scenarios: first for elastic, upscaling environments, second for including heterogeneous hardware like GPGPUs, and third for distributed runtime systems. Extensive performance evaluations indicate ideal runtime behaviour for up to 64 cores at very low memory footprint, or in the presence of GPUs. In these tests, CAF continuously outperforms the competing actor environments Erlang, Charm++, SalsaLite, Scala, ActorFoundry, and even the OpenMPI.Comment: 33 page

    Twos Company, Threes A Cloud: Challenges To Implementing Service Models

    Get PDF
    Although three models are currently being used in cloud computing (Software as a Service, Platform as a Service, and infrastructure as a service, there remain many challenges before most business accept cloud computing as a reality. Virtualization in cloud computing has many advantages but carries a penalty because of state configurations, kernel drivers, and user interface environments. In addition, many non-standard architectures exist to power cloud models that are often incompatible. Another issue is adequately provisioning the resources required for a multi-tier cloud-based application in such a way that on-demand elasticity is present at vastly different scales yet is carried out efficiently. For networks that have large geographical footprints another problem arises from bottlenecks between elements supporting virtual machines and their control. While many solutions have been proposed to alleviate these problems, some of which are already commercial, much remains to be done to see whether these solutions will be practicable at scale up and address business concerns

    Will SDN be part of 5G?

    Get PDF
    For many, this is no longer a valid question and the case is considered settled with SDN/NFV (Software Defined Networking/Network Function Virtualization) providing the inevitable innovation enablers solving many outstanding management issues regarding 5G. However, given the monumental task of softwarization of radio access network (RAN) while 5G is just around the corner and some companies have started unveiling their 5G equipment already, the concern is very realistic that we may only see some point solutions involving SDN technology instead of a fully SDN-enabled RAN. This survey paper identifies all important obstacles in the way and looks at the state of the art of the relevant solutions. This survey is different from the previous surveys on SDN-based RAN as it focuses on the salient problems and discusses solutions proposed within and outside SDN literature. Our main focus is on fronthaul, backward compatibility, supposedly disruptive nature of SDN deployment, business cases and monetization of SDN related upgrades, latency of general purpose processors (GPP), and additional security vulnerabilities, softwarization brings along to the RAN. We have also provided a summary of the architectural developments in SDN-based RAN landscape as not all work can be covered under the focused issues. This paper provides a comprehensive survey on the state of the art of SDN-based RAN and clearly points out the gaps in the technology.Comment: 33 pages, 10 figure
    • 

    corecore