133 research outputs found
Integrated shared-memory and message-passing communication in the Alewife multiprocessor
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 237-246) and index.by John David Kubiatowicz.Ph.D
Recommended from our members
Design and Optimization of Networks-on-Chip for Future Heterogeneous Systems-on-Chip
Due to the tight power budget and reduced time-to-market, Systems-on-Chip (SoC) have emerged as a power-efficient solution that provides the functionality required by target applications in embedded systems. To support a diverse set of applications such as real-time video/audio processing and sensor signal processing, SoCs consist of multiple heterogeneous components, such as software processors, digital signal processors, and application-specific hardware accelerators. These components offer different flexibility, power, and performance values so that SoCs can be designed by mix-and-matching them.
With the increased amount of heterogeneous cores, however, the traditional interconnects in an SoC exhibit excessive power dissipation and poor performance scalability. As an alternative, Networks-on-Chip (NoC) have been proposed. NoCs provide modularity at design-time because
communications among the cores are isolated from their computations via standard interfaces. NoCs also exploit communication parallelism at run-time because multiple data can be transferred simultaneously.
In order to construct an efficient NoC, the communication behaviors of various heterogeneous components in an SoC must be considered with the large amount of NoC design parameters. Therefore, providing an efficient NoC design and optimization framework is critical to reduce the design
cycle and address the complexity of future heterogeneous SoCs. This is the thesis of my dissertation.
Some existing design automation tools for NoCs support very limited degrees of automation that cannot satisfy the requirements of future heterogeneous SoCs. First, these tools only support a limited number of NoC design parameters. Second, they do not provide an integrated environment for software-hardware co-development.
Thus, I propose FINDNOC, an integrated framework for the generation, optimization, and validation of NoCs for future heterogeneous SoCs. The proposed framework supports software-hardware co-development, incremental NoC design-decision model, SystemC-based NoC customization and generation, and fast system protyping with FPGA emulations.
Virtual channels (VC) and multiple physical (MP) networks are the two main alternative methods to provide better performance, support quality-of-service, and avoid protocol deadlocks in packet-switched NoC design. To examine the effect of using VCs and MPs with other NoC architectural
parameters, I completed a comprehensive comparative analysis that combines an analytical model, synthesis-based designs for both FPGAs and standard-cell libraries, and system-level simulations.
Based on the results of this analysis, I developed VENTTI, a design and simulation environment that combines a virtual platform (VP), a NoC synthesis tool, and four NoC models characterized at different abstraction levels. VENTTI facilitates an incremental decision-making process with four
NoC abstraction models associated with different NoC parameters. The selected NoC parameters can be validated by running simulations with the corresponding model instantiated in the VP.
I augmented this framework to complete FINDNOC by implementing ICON, a NoC generation and customization tool that dynamically combines and customizes synthesizable SystemC components from a predesigned library. Thanks to its flexibility and automatic network interface generation
capabilities, ICON can generate a rich variety of NoCs that can be then integrated into any Embedded Scalable Platform (ESP) architectures for fast prototying with FPGA emulations.
I designed FINDNOC in a modular way that makes it easy to augmenting it with new capabilities. This, combined with the continuous progress of the ESP design methodology, will provide a seamless SoC integration framework, where the hardware accelerators, software applications, and
NoCs can be designed, validated, and integrated simultaneously, in order to reduce the design cycle of future SoC platforms
Off-chip Communications Architectures For High Throughput Network Processors
In this work, we present off-chip communications architectures for line cards to increase the throughput of the currently used memory system. In recent years there is a significant increase in memory bandwidth demand on line cards as a result of higher line rates, an increase in deep packet inspection operations and an unstoppable expansion in lookup tables. As line-rate data and NPU processing power increase, memory access time becomes the main system bottleneck during data store/retrieve operations. The growing demand for memory bandwidth contrasts the notion of indirect interconnect methodologies. Moreover, solutions to the memory bandwidth bottleneck are limited by physical constraints such as area and NPU I/O pins. Therefore, indirect interconnects are replaced with direct, packet-based networks such as mesh, torus or k-ary n-cubes. We investigate multiple k-ary n-cube based interconnects and propose two variations of 2-ary 3-cube interconnect called the 3D-bus and 3D-mesh. All of the k-ary n-cube interconnects include multiple, highly efficient techniques to route, switch, and control packet flows in order to minimize congestion spots and packet loss. We explore the tradeoffs between implementation constraints and performance. We also developed an event-driven, interconnect simulation framework to evaluate the performance of packet-based off-chip k-ary n-cube interconnect architectures for line cards. The simulator uses the state-of-the-art software design techniques to provide the user with a flexible yet robust tool, that can emulate multiple interconnect architectures under non-uniform traffic patterns. Moreover, the simulator offers the user with full control over network parameters, performance enhancing features and simulation time frames that make the platform as identical as possible to the real line card physical and functional properties. By using our network simulator, we reveal the best processor-memory configuration, out of multiple configurations, that achieves optimal performance. Moreover, we explore how network enhancement techniques such as virtual channels and sub-channeling improve network latency and throughput. Our performance results show that k-ary n-cube topologies, and especially our modified version of 2-ary 3-cube interconnect - the 3D-mesh, significantly outperform existing line card interconnects and are able to sustain higher traffic loads. The flow control mechanism proved to extensively reduce hot-spots, load-balance areas of high traffic rate and achieve low transmission failure rate. Moreover, it can scale to adopt more memories and/or processors and as a result to increase the line card\u27s processing power
Embedded dynamic programming networks for networks-on-chip
PhD ThesisRelentless technology downscaling and recent technological advancements
in three dimensional integrated circuit (3D-IC) provide a promising
prospect to realize heterogeneous system-on-chip (SoC) and homogeneous
chip multiprocessor (CMP) based on the networks-onchip
(NoCs) paradigm with augmented scalability, modularity and
performance. In many cases in such systems, scheduling and managing
communication resources are the major design and implementation
challenges instead of the computing resources. Past research
efforts were mainly focused on complex design-time or simple heuristic
run-time approaches to deal with the on-chip network resource
management with only local or partial information about the network.
This could yield poor communication resource utilizations and amortize
the benefits of the emerging technologies and design methods.
Thus, the provision for efficient run-time resource management in
large-scale on-chip systems becomes critical. This thesis proposes a
design methodology for a novel run-time resource management infrastructure
that can be realized efficiently using a distributed architecture,
which closely couples with the distributed NoC infrastructure. The
proposed infrastructure exploits the global information and status
of the network to optimize and manage the on-chip communication
resources at run-time.
There are four major contributions in this thesis. First, it presents a
novel deadlock detection method that utilizes run-time transitive closure
(TC) computation to discover the existence of deadlock-equivalence
sets, which imply loops of requests in NoCs. This detection scheme,
TC-network, guarantees the discovery of all true-deadlocks without
false alarms in contrast to state-of-the-art approximation and heuristic
approaches. Second, it investigates the advantages of implementing
future on-chip systems using three dimensional (3D) integration and
presents the design, fabrication and testing results of a TC-network
implemented in a fully stacked three-layer 3D architecture using a
through-silicon via (TSV) complementary metal-oxide semiconductor
(CMOS) technology. Testing results demonstrate the effectiveness
of such a TC-network for deadlock detection with minimal computational
delay in a large-scale network. Third, it introduces an adaptive
strategy to effectively diffuse heat throughout the three dimensional
network-on-chip (3D-NoC) geometry. This strategy employs a dynamic
programming technique to select and optimize the direction of data
manoeuvre in NoC. It leads to a tool, which is based on the accurate
HotSpot thermal model and SystemC cycle accurate model, to simulate
the thermal system and evaluate the proposed approach. Fourth, it
presents a new dynamic programming-based run-time thermal management
(DPRTM) system, including reactive and proactive schemes, to
effectively diffuse heat throughout NoC-based CMPs by routing packets
through the coolest paths, when the temperature does not exceed
chip’s thermal limit. When the thermal limit is exceeded, throttling is
employed to mitigate heat in the chip and DPRTM changes its course
to avoid throttled paths and to minimize the impact of throttling on
chip performance.
This thesis enables a new avenue to explore a novel run-time resource
management infrastructure for NoCs, in which new methodologies
and concepts are proposed to enhance the on-chip networks for
future large-scale 3D integration.Iraqi Ministry of Higher Education and Scientific Research (MOHESR)
Cloud-Based Multi-Agent Cooperation for IoT Devices Using Workflow-Nets
Most Internet of Things (IoT)-based service requests require excessive computation which exceeds an IoT device's capabilities. Cloud-based solutions were introduced to outsource most of the computation to the data center. The integration of multi-agent IoT systems with cloud computing technology makes it possible to provide faster, more efficient and real-time solutions. Multi-agent cooperation for distributed systems such as fog-based cloud computing has gained popularity in contemporary research areas such as service composition and IoT robotic systems. Enhanced cloud computing performance gains and fog site load distribution are direct achievements of such cooperation. In this article, we pro- pose a work ow-net based framework for agent cooperation to enable collaboration among fog computing devices and form a cooperative IoT service delivery system. A cooperation operator is used to find the topology and structure of the resulting cooperative set of fog computing agents. The operator shifts the problem defined as a set of work ow-nets into algebraic representations to provide a mechanism for solving the optimization problem mathematically. IoT device resource and collaboration capabilities are properties which are considered in the selection process of the cooperating IoT agents from di_erent fog computing sites. Experimental results in the form of simulation and implementation show that the cooperation process increases the number of achieved tasks and is performed in a timely manner
OSCAR. A Noise Injection Framework for Testing Concurrent Software
“Moore’s Law” is a well-known observable phenomenon in computer science that describes a
visible yearly pattern in processor’s die increase. Even though it has held true for the last 57
years, thermal limitations on how much a processor’s core frequencies can be increased, have
led to physical limitations to their performance scaling. The industry has since then shifted
towards multicore architectures, which offer much better and scalable performance, while in
turn forcing programmers to adopt the concurrent programming paradigm when designing new
software, if they wish to make use of this added performance. The use of this paradigm comes
with the unfortunate downside of the sudden appearance of a plethora of additional errors in
their programs, stemming directly from their (poor) use of concurrency techniques.
Furthermore, these concurrent programs themselves are notoriously hard to design and to
verify their correctness, with researchers continuously developing new, more effective and effi-
cient methods of doing so. Noise injection, the theme of this dissertation, is one such method. It
relies on the “probe effect” — the observable shift in the behaviour of concurrent programs upon
the introduction of noise into their routines. The abandonment of ConTest, a popular proprietary
and closed-source noise injection framework, for testing concurrent software written using the
Java programming language, has left a void in the availability of noise injection frameworks for
this programming language.
To mitigate this void, this dissertation proposes OSCAR — a novel open-source noise injection
framework for the Java programming language, relying on static bytecode instrumentation for
injecting noise. OSCAR will provide a free and well-documented noise injection tool for research,
pedagogical and industry usage. Additionally, we propose a novel taxonomy for categorizing new
and existing noise injection heuristics, together with a new method for generating and analysing
concurrent software traces, based on string comparison metrics.
After noising programs from the IBM Concurrent Benchmark with different heuristics, we
observed that OSCAR is highly effective in increasing the coverage of the interleaving space, and
that the different heuristics provide diverse trade-offs on the cost and benefit (time/coverage) of
the noise injection process.Resumo
A “Lei de Moore” é um fenómeno, bem conhecido na área das ciências da computação, que
descreve um padrão evidente no aumento anual da densidade de transístores num processador.
Mesmo mantendo-se válido nos últimos 57 anos, o aumento do desempenho dos processadores
continua garrotado pelas limitações térmicas inerentes `a subida da sua frequência de funciona-
mento. Desde então, a industria transitou para arquiteturas multi núcleo, com significativamente
melhor e mais escalável desempenho, mas obrigando os programadores a adotar o paradigma
de programação concorrente ao desenhar os seus novos programas, para poderem aproveitar o
desempenho adicional que advém do seu uso. O uso deste paradigma, no entanto, traz consigo,
por consequência, a introdução de uma panóplia de novos erros nos programas, decorrentes
diretamente da utilização (inadequada) de técnicas de programação concorrente.
Adicionalmente, estes programas concorrentes são conhecidos por serem consideravelmente
mais difíceis de desenhar e de validar, quanto ao seu correto funcionamento, incentivando investi-
gadores ao desenvolvimento de novos métodos mais eficientes e eficazes de o fazerem. A injeção
de ruído, o tema principal desta dissertação, é um destes métodos. Esta baseia-se no “efeito sonda”
(do inglês “probe effect”) — caracterizado por uma mudança de comportamento observável em
programas concorrentes, ao terem ruído introduzido nas suas rotinas. Com o abandono do Con-
Test, uma framework popular, proprietária e de código fechado, de análise dinâmica de programas
concorrentes através de injecção de ruído, escritos com recurso `a linguagem de programação Java,
viu-se surgir um vazio na oferta de framework de injeção de ruído, para esta mesma linguagem.
Para mitigar este vazio, esta dissertação propõe o OSCAR — uma nova framework de injeção de
ruído, de código-aberto, para a linguagem de programação Java, que utiliza manipulação estática
de bytecode para realizar a introdução de ruído. O OSCAR pretende oferecer uma ferramenta
livre e bem documentada de injeção de ruído para fins de investigação, pedagógicos ou até para
a indústria. Adicionalmente, a dissertação propõe uma nova taxonomia para categorizar os dife-
rentes tipos de heurísticas de injecção de ruídos novos e existentes, juntamente com um método
para gerar e analisar traces de programas concorrentes, com base em métricas de comparação de
strings.
Após inserir ruído em programas do IBM Concurrent Benchmark, com diversas heurísticas, ob-
servámos que o OSCAR consegue aumentar significativamente a dimensão da cobertura do espaço de estados de programas concorrentes. Adicionalmente, verificou-se que diferentes heurísticas
produzem um leque variado de prós e contras, especialmente em termos de eficácia versus
eficiência
A semantically aware transactional concurrency control for GPGPU computing
PhD ThesisThe increased parallel nature of the GPU a ords an opportunity for
the exploration of multi-thread computing. With the availability of
GPU has recently expanded into the area of general purpose program-
ming, a concurrency control is required to exploit parallelism as well
as maintaining the correctness of program. Transactional memory,
which is a generalised solution for concurrent con
ict, meanwhile allow
application programmers to develop concurrent code in a more intu-
itive manner, is superior to pessimistic concurrency control for general
use. The most important component in any transactional memory
technique is the policy to solve con
icts on shared data, namely the
contention management policy.
The work presented in this thesis aims to develop a software trans-
actional memory model which can solve both concurrent con
ict and
semantic con
ict at the same time for the GPU. The technique di ers
from existing CPU approaches on account of the di erent essential ex-
ecution paths and hardware basis, plus much more parallel resources
are available on the GPU. We demonstrate that both concurrent con-
icts and semantic con
icts can be resolved in a particular contention
management policy for the GPU, with a di erent application of locks
and priorities from the CPU.
The basic problem and a software transactional memory solution idea
is proposed rst. An implementation is then presented based on the
execution mode of this model. After that, we extend this system to re-
solve semantic con
ict at the same time. Results are provided nally,
which compare the performance of our solution with an established
GPU software transactional memory and a famous CPU transactional
memory, at varying levels of concurrent and semantic con
icts
A Scalable and Adaptive Network on Chip for Many-Core Architectures
In this work, a scalable network on chip (NoC) for future many-core architectures is proposed and investigated. It supports different QoS mechanisms to ensure predictable communication. Self-optimization is introduced to adapt the energy footprint and the performance of the network to the communication requirements. A fault tolerance concept allows to deal with permanent errors. Moreover, a template-based automated evaluation and design methodology and a synthesis flow for NoCs is introduced
On Fault Tolerance Methods for Networks-on-Chip
Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit.
This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels.
The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model.
The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated.
At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast
- …