103 research outputs found
MPSoCBench : um framework para avaliação de ferramentas e metodologias para sistemas multiprocessados em chip
Orientador: Rodolfo Jardim de AzevedoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Recentes metodologias e ferramentas de projetos de sistemas multiprocessados em chip (MPSoC) aumentam a produtividade por meio da utilização de plataformas baseadas em simuladores, antes de definir os últimos detalhes da arquitetura. No entanto, a simulação só é eficiente quando utiliza ferramentas de modelagem que suportem a descrição do comportamento do sistema em um elevado nÃvel de abstração. A escassez de plataformas virtuais de MPSoCs que integrem hardware e software escaláveis nos motivou a desenvolver o MPSoCBench, que consiste de um conjunto escalável de MPSoCs incluindo quatro modelos de processadores (PowerPC, MIPS, SPARC e ARM), organizado em plataformas com 1, 2, 4, 8, 16, 32 e 64 núcleos, cross-compiladores, IPs, interconexões, 17 aplicações paralelas e estimativa de consumo de energia para os principais componentes (processadores, roteadores, memória principal e caches). Uma importante demanda em projetos MPSoC é atender à s restrições de consumo de energia o mais cedo possÃvel. Considerando que o desempenho do processador está diretamente relacionado ao consumo, há um crescente interesse em explorar o trade-off entre consumo de energia e desempenho, tendo em conta o domÃnio da aplicação alvo. Técnicas de escalabilidade dinâmica de freqüência e voltagem fundamentam-se em gerenciar o nÃvel de tensão e frequência da CPU, permitindo que o sistema alcance apenas o desempenho suficiente para processar a carga de trabalho, reduzindo, consequentemente, o consumo de energia. Para explorar a eficiência energética e desempenho, foram adicionados recursos ao MPSoCBench, visando explorar escalabilidade dinâmica de voltaegem e frequência (DVFS) e foram validados três mecanismos com base na estimativa dinâmica de energia e taxa de uso de CPUAbstract: Recent design methodologies and tools aim at enhancing the design productivity by providing a software development platform before the definition of the final Multiprocessor System on Chip (MPSoC) architecture details. However, simulation can only be efficiently performed when using a modeling and simulation engine that supports system behavior description at a high abstraction level. The lack of MPSoC virtual platform prototyping integrating both scalable hardware and software in order to create and evaluate new methodologies and tools motivated us to develop the MPSoCBench, a scalable set of MPSoCs including four different ISAs (PowerPC, MIPS, SPARC, and ARM) organized in platforms with 1, 2, 4, 8, 16, 32, and 64 cores, cross-compilers, IPs, interconnections, 17 parallel version of software from well-known benchmarks, and power consumption estimation for main components (processors, routers, memory, and caches). An important demand in MPSoC designs is the addressing of energy consumption constraints as early as possible. Whereas processor performance comes with a high power cost, there is an increasing interest in exploring the trade-off between power and performance, taking into account the target application domain. Dynamic Voltage and Frequency Scaling techniques adaptively scale the voltage and frequency levels of the CPU allowing it to reach just enough performance to process the system workload while meeting throughput constraints, and thereby, reducing the energy consumption. To explore this wide design space for energy efficiency and performance, both for hardware and software components, we provided MPSoCBench features to explore dynamic voltage and frequency scalability (DVFS) and evaluated three mechanisms based on energy estimation and CPU usage rateDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã
Fast Simulation of Programmable Network Forwarding Plane Devices
With the evolution of the Internet, the processing of packets at the routers while providing
flexibility in deploying new protocols and services at the same time has become a major concern.
Programmable forwarding elements with high processing capability have emerged as a solution.
But the main challenge is to find the optimal hardware architecture while taking into account
constraints such as different packet processing functions, task scheduling options, electrical
power consumption and providing quality-of-service (QoS) guarantees. Therefore, it is essential
to investigate methods that help in identifying limitations and bottlenecks before physical
fabrication. Having an appropriate model provides designers a progressive path to narrow the
design space and establish credible and feasible alternatives before deciding on an
implementation.
In this thesis, we propose a flexible and fast instruction accurate host-compiled simulator to
make it possible to explore wide ranges of architectures and application scenarios to find the
optimal configuration that meets given performance, throughput and latency for programmable
forwarding elements. Application developers can use the simulator as a virtual prototype to
simulate and debug their applications before hardware availability. Moreover, forwarding device
architects can use simulator to evaluate the trade-offs between different hardware/software
design decisions
SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads
In recent years, there has been tremendous advances in hardware acceleration
of deep neural networks. However, most of the research has focused on
optimizing accelerator microarchitecture for higher performance and energy
efficiency on a per-layer basis. We find that for overall single-batch
inference latency, the accelerator may only make up 25-40%, with the rest spent
on data movement and in the deep learning software framework. Thus far, it has
been very difficult to study end-to-end DNN performance during early stage
design (before RTL is available) because there are no existing DNN frameworks
that support end-to-end simulation with easy custom hardware accelerator
integration. To address this gap in research infrastructure, we present SMAUG,
the first DNN framework that is purpose-built for simulation of end-to-end deep
learning applications. SMAUG offers researchers a wide range of capabilities
for evaluating DNN workloads, from diverse network topologies to easy
accelerator modeling and SoC integration. To demonstrate the power and value of
SMAUG, we present case studies that show how we can optimize overall
performance and energy efficiency for up to 1.8-5x speedup over a baseline
system, without changing any part of the accelerator microarchitecture, as well
as show how SMAUG can tune an SoC for a camera-powered deep learning pipeline.Comment: 14 pages, 20 figure
SdrLift: A Domain-Specific Intermediate Hardware Synthesis Framework for Prototyping Software-Defined Radios
Modern design of Software-Defined Radio (SDR) applications is based on Field Programmable Gate Arrays (FPGA) due to their ability to be configured into solution architectures that are well suited to domain-specific problems while achieving the best trade-off between performance, power, area, and flexibility. FPGAs are well known for rich computational resources, which traditionally include logic, register, and routing resources. The increased technological advances have seen FPGAs incorporating more complex components that comprise sophisticated memory blocks, Digital Signal Processing (DSP) blocks, and high-speed interfacing to Gigabit Ethernet (GbE) and Peripheral Component Interconnect Express (PCIe) bus. Gateware for programming FPGAs is described at a lowlevel of design abstraction using Register Transfer Language (RTL), typically using either VHSIC-HDL (VHDL) or Verilog code. In practice, the low-level description languages have a very steep learning curve, provide low productivity for hardware designers and lack readily available open-source library support for fundamental designs, and consequently limit the design to only hardware experts. These limitations have led to the adoption of High-Level Synthesis (HLS) tools that raise design abstraction using syntax, semantics, and software development notations that are well-known to most software developers. However, while HLS has made programming of FPGAs more accessible and can increase the productivity of design, they are still not widely adopted in the design community due to the low-level skills that are still required to produce efficient designs. Additionally, the resultant RTL code from HLS tools is often difficult to decipher, modify and optimize due to the functionality and micro-architecture that are coupled together in a single High-Level Language (HLL). In order to alleviate these problems, Domain-Specific Languages (DSL) have been introduced to capture algorithms at a high level of abstraction with more expressive power and providing domain-specific optimizations that factor in new transformations and the trade-off between resource utilization and system performance. The problem of existing DSLs is that they are designed around imperative languages with an instruction sequence that does not match the hardware structure and intrinsics, leading to hardware designs with system properties that are unconformable to the high-level specifications and constraints. The aim of this thesis is, therefore, to design and implement an intermediatelevel framework namely SdrLift for use in high-level rapid prototyping of SDR applications that are based on an FPGA. The SdrLift input is a HLL developed using functional language constructs and design patterns that specify the structural behavior of the application design. The functionality of the SdrLift language is two-fold, first, it can be used directly by a designer to develop the SDR applications, secondly, it can be used as the Intermediate Representation (IR) step that is generated by a higher-level language or a DSL. The SdrLift compiler uses the dataflow graph as an IR to structurally represent the accelerator micro-architecture in which the components correspond to the fine-level and coarse-level Hardware blocks (HW Block) which are either auto-synthesized or integrated from existing reusable Intellectual Property (IP) core libraries. Another IR is in the form of a dataflow model and it is used for composition and global interconnection of the HW Blocks while making efficient interfacing decisions in an attempt to satisfy speed and resource usage objectives. Moreover, the dataflow model provides rules and properties that will be used to provide a theoretical framework that formally analyzes the characteristics of SDR applications (i.e. the throughput, sample rate, latency, and buffer size among other factors). Using both the directed graph flow (DFG) and the dataflow model in the SdrLift compiler provides two benefits: an abstraction of the microarchitecture from the high-level algorithm specifications and also decoupling of the microarchitecture from the low-level RTL implementation. Following the IR creation and model analyses is the VHDL code generation which employs the low-level optimizations that ensure optimal hardware design results. The code generation process per forms analysis to ensure the resultant hardware system conforms to the high-level design specifications and constraints. SdrLift is evaluated by developing representative SDR case studies, in which the VHDL code for eight different SDR applications is generated. The experimental results show that SdrLift achieves the desired performance and flexibility, while also conserving the hardware resources utilized
Domain Specific Language for Magnetic Measurements at CERN
CERN, the European Organization for Nuclear Research, is one of the world’s largest and most respected centres for scientific research. Founded in 1954, the CERN Laboratory sits astride the Franco–Swiss border near Geneva. It was one of Europe’s first joint ventures and now has 20 Member States. Its main purpose is fundamental research in partcle physics, namely investigating what the Universe is made of and how it works. At CERN, the design and realization of the new particle accelerator, the Large Hadron Collider (LHC), has required a remarkable technological effort in many areas of engineering. In particular, the tests of LHC superconducting magnets disclosed new horizons to magnetic measurements. At CERN, the objectively large R&D effort of the Technolgy Department/Magnets, Superconductors and Cryostats (TE/MSC) group identified areas where further work is required in order to assist the LHC commissioning and start-up, to provide continuity in the instrumentation for the LHC magnets maintenance, and to achieve more accurate magnet models for the LHC exploitation. In view of future projects, a wide range of software requirements has been recently satisfied by the Flexible Framework for Magnetic Measurements (FFMM), designed also for integrating more performing flexible hardware. FFMM software applications control several devices, such as encoder boards, digital integrators, motor controllers, transducers. In addition, they synchronize and coordinate different measurement tasks and actions
Recommended from our members
Scalable System-on-Chip Design
The crisis of technology scaling led the industry of semiconductors towards the adoption of disruptive technologies and innovations to sustain the evolution of microprocessors and keep under control the timing of the design cycle. Multi-core and many-core architectures sought more energy-efficient computation by replacing a power-hungry processor with multiple simpler cores exploiting parallelism. Multi-core processors alone, however, turned out to be insufficient to sustain the ever growing demand for energy and power-efficient computation without compromising performance. Therefore, designers were pushed to drift from homogeneous architectures towards more complex heterogeneous systems that employ the large number of available transistors to incorporate a combination of customized energy-efficient accelerators, along with the general-purpose processor cores. Meanwhile, enhancements in manufacturing processes allowed designers to move a variety of peripheral components and analog devices into the chip. This paradigm shift defined the concept of {\em system-on-chip} (SoC) as a single-chip design that integrates several heterogeneous components. The rise of SoCs corresponds to a rapid decrease of the opportunity cost for integrating accelerators. In fact, on one hand, employing more transistors for powerful cores is not feasible anymore, because transistors cannot be active all at once within reasonable power budgets. On the other hand, increasing the number of homogeneous cores incurs more and more diminishing returns. The availability of cost effective silicon area for specialized hardware creates an opportunity to enter the market of semiconductors for new small players: engineers from several different scientific areas can develop competitive algorithms suitable for acceleration for domain-specific applications, such as multimedia systems, self-driving vehicles, robotics, and more. However, turning these algorithms into SoC components, referred to as {\em intellectual property}, still requires expert hardware designers who are typically not familiar with the specific domain of the target application. Furthermore, heterogeneity makes SoC design and programming much more difficult, especially because of the challenges of the integration process. This is a fine art in the hands of few expert engineers who understand system-level trade-offs, know how to design good hardware, how to handle memory and power management, how to shape and balance the traffic over an interconnect, and are able to deal with many different hardware-software interfaces. Designers need solutions enabling them to build scalable and heterogeneous SoCs. My thesis is that {\em the key to scalable SoC designs is a regular and flexible architecture that hides the complexity of heterogeneous integration from designers, while helping them focus on the important aspects of domain-specific applications through a companion system-level design methodology.} I open a path towards this goal by proposing an architecture that mitigates heterogeneity with regularity and addresses the challenges of heterogeneous component integration by implementing a set of {\em platform services}. These are hardware and software interfaces that from a system-level viewpoint give the illusion of working with a homogeneous SoC, thus making it easier to reuse accelerators and port applications across different designs, each with its own target workload and cost-performance trade-off point. A companion system-level design methodology exploits the regularity of the architecture to guide designers in implementing their intellectual property and enables an extensive design-space exploration across multiple levels of abstraction. Throughout the dissertation, I present a fully automated flow to deploy heterogeneous SoCs on single or multiple field-programmable-gate-array devices. The flow provides non-expert designers with a set of knobs for tuning system-level features based on the given mix of accelerators that they have integrated. Many contributions of my dissertation have already influenced other research projects as well as the content of an advanced course for graduate and senior undergraduate students, which aims to form a new generation of system-level designers. These new professionals need not to be circuit or register-transfer level design experts, and not even gurus of operating systems. Instead, they are trained to design efficient intellectual property by considering system-level trade-offs, while the architecture and the methodology that I describe in this dissertation empower them to integrate their components into an SoC. Finally, with the open-source release of the entire infrastructure, including the SoC-deployment flow and the software stack, I hope I will be able to inspire other research groups and help them implement ideas that further reduce the cost and design-time of future heterogeneous systems
Recommended from our members
Logical partitioning of parallel system simulations
Simulation has been a fundamental tool to prototype, hypothesize, and evaluate
new ideas to continue improving system performance. However, increasing levels
of processor parallelism and heterogeneity have introduced additional
constraints when evaluating new designs. The work embodied in this dissertation
explores how to leverage novel ideas in simulator partitioning to improve
simulator speed and flexibility for simulating these new types of systems.
The contribution of this work includes the introduction of optimistic
partitioned simulation to improve parallelization, and the introduction of
warped partitioned simulation for improved flexibility. These ideas are refined
and demonstrated through the use of prototypes to demonstrate their benefits
compared to state-of-the-art approaches. By leveraging partitioning in a
structured manner, it is possible to design simulators that better address the
open challenges of parallel and heterogeneous systems design.Electrical and Computer Engineerin
Recommended from our members
Performance Debugging Frameworks for FPGA High-Level Synthesis
Using high-level synthesis (HLS) tools for field-programmable gate array (FPGA) design is becoming an increasingly popular choice because HLS tools can generate a high-quality design in a short development time. However, current HLS tools still cannot adequately support users in understanding and fixing the performance issues of the current design. That is, current HLS tools lack in performance debugging capability. Previous work on performance debugging automates the process of inserting hardware monitors in low-level register-transfer level (RTL) languages which limits the comprehensibility of the obtained result. Instead, our HLS-based flows offer analysis on a function or loop level and provide more intuitive feedback that can be used to pinpoint the performance bottleneck of a design. In this dissertation, we present a collection of HLS-based debugging frameworks for various purposes and characteristics of the design. First, we address the problem in the HLS synthesis step, where an inaccurate cycle estimation is provided if the program has input-dependent behavior. We propose a new performance estimator that automatically instruments code that models the hardware execution behavior and interprets the information from the HLS software simulation. However, the performance estimation result of this flow may not be accurate for a type of designs that cannot be simulated correctly by existing HLS software simulators. To handle such cases, we propose a new software simulator that provides cycle-accurate result based on the HLS scheduling information. If the input dataset is not available for software simulation or high-level models do not exist for all components of the FPGA design, we also present an on-board monitoring flow for automated cycle extraction and stall analysis. Finally, we address the needs of HLS programmers to automatically find the best set of directives for FPGA designs. We propose a design space exploration (DSE) framework to optimize applications with variable loop bounds in Polybench benchmark. A quantitative comparison among the proposed frameworks is shown using the sparse matrix-vector multiplication benchmark
Proceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015) Krakow, Poland
Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015
- …