375 research outputs found
Polymorphic computing abstraction for heterogeneous architectures
Integration of multiple computing paradigms onto system on chip (SoC) has pushed the boundaries of design space exploration for hardware architectures and computing system software stack. The heterogeneity of computing styles in SoC has created a new class of architectures referred to as Heterogeneous Architectures. Novel applications developed to exploit the different computing styles are user centric for embedded SoC. Software and hardware designers are faced with several challenges to harness the full potential of heterogeneous architectures. Applications have to execute on more than one compute style to increase overall SoC resource utilization. The implication of such an abstraction is that application threads need to be polymorphic. Operating system layer is thus faced with the problem of scheduling polymorphic threads. Resource allocation is also an important problem to be dealt by the OS. Morphism evolution of application threads is constrained by the availability of heterogeneous computing resources. Traditional design optimization goals such as computational power and lower energy per computation are inadequate to satisfy user centric application resource needs. Resource allocation decisions at application layer need to permeate to the architectural layer to avoid conflicting demands which may affect energy-delay characteristics of application threads. We propose Polymorphic computing abstraction as a unified computing model for heterogeneous architectures to address the above issues. Simulation environment for polymorphic applications is developed and evaluated under various scheduling strategies to determine the effectiveness of polymorphism abstraction on resource allocation. User satisfaction model is also developed to complement polymorphism and used for optimization of resource utilization at application and network layer of embedded systems
A time-predictable many-core processor design for critical real-time embedded systems
Critical Real-Time Embedded Systems (CRTES) are in charge of controlling fundamental parts of embedded system, e.g. energy harvesting solar panels in satellites, steering and breaking in cars, or flight management systems in airplanes. To do so, CRTES require strong evidence of correct functional and timing behavior. The former guarantees that the system operates correctly in response of its inputs; the latter ensures that its operations are performed within a predefined time budget.
CRTES aim at increasing the number and complexity of functions. Examples include the incorporation of \smarter" Advanced Driver Assistance System (ADAS) functionality in modern cars or advanced collision avoidance systems in Unmanned Aerial Vehicles (UAVs). All these new features, implemented in software, lead to an exponential growth in both performance requirements and software development complexity. Furthermore, there is a strong need to integrate multiple functions into the same computing platform to reduce the number of processing units, mass and space requirements, etc. Overall, there is a clear need to increase the computing power of current CRTES in order to support new sophisticated and complex functionality, and integrate multiple systems into a single platform.
The use of multi- and many-core processor architectures is increasingly seen in the CRTES industry as the solution to cope with the performance demand and cost constraints of future CRTES. Many-cores supply higher performance by exploiting the parallelism of applications while providing a better performance per watt as cores are maintained simpler with respect to complex single-core processors. Moreover, the parallelization capabilities allow scheduling multiple functions into the same processor, maximizing the hardware utilization.
However, the use of multi- and many-cores in CRTES also brings a number of challenges related to provide evidence about the correct operation of the system, especially in the timing domain. Hence, despite the advantages of many-cores and the fact that they are nowadays a reality in the embedded domain (e.g. Kalray MPPA, Freescale NXP P4080, TI Keystone II), their use in CRTES still requires finding efficient ways of providing reliable evidence about the correct operation of the system.
This thesis investigates the use of many-core processors in CRTES as a means to satisfy performance demands of future complex applications while providing the necessary timing guarantees. To do so, this thesis contributes to advance the state-of-the-art towards the exploitation of parallel capabilities of many-cores in CRTES contributing in two different computing domains. From the hardware domain, this thesis proposes new many-core designs that enable deriving reliable and tight timing guarantees. From the software domain, we present efficient scheduling and timing analysis techniques to exploit the parallelization capabilities of many-core architectures and to derive tight and trustworthy Worst-Case Execution Time (WCET) estimates of CRTES.Los sistemas críticos empotrados de tiempo real (en ingles Critical Real-Time Embedded Systems, CRTES) se encargan de controlar partes fundamentales de los sistemas integrados, e.g. obtención de la energía de los paneles solares en satélites, la dirección y frenado en automóviles, o el control de vuelo en aviones. Para hacerlo, CRTES requieren fuerte evidencias del correcto comportamiento funcional y temporal. El primero garantiza que el sistema funciona correctamente en respuesta de sus entradas; el último asegura que sus operaciones se realizan dentro de unos limites temporales establecidos previamente. El objetivo de los CRTES es aumentar el número y la complejidad de las funciones. Algunos ejemplos incluyen los sistemas inteligentes de asistencia a la conducción en automóviles modernos o los sistemas avanzados de prevención de colisiones en vehiculos aereos no tripulados. Todas estas nuevas características, implementadas en software,conducen a un crecimiento exponencial tanto en los requerimientos de rendimiento como en la complejidad de desarrollo de software. Además, existe una gran necesidad de integrar múltiples funciones en una sóla plataforma para así reducir el número de unidades de procesamiento, cumplir con requisitos de peso y espacio, etc. En general, hay una clara necesidad de aumentar la potencia de cómputo de los actuales CRTES para soportar nueva funcionalidades sofisticadas y complejas e integrar múltiples sistemas en una sola plataforma. El uso de arquitecturas multi- y many-core se ve cada vez más en la industria CRTES como la solución para hacer frente a la demanda de mayor rendimiento y las limitaciones de costes de los futuros CRTES. Las arquitecturas many-core proporcionan un mayor rendimiento explotando el paralelismo de aplicaciones al tiempo que proporciona un mejor rendimiento por vatio ya que los cores se mantienen más simples con respecto a complejos procesadores de un solo core. Además, las capacidades de paralelización permiten programar múltiples funciones en el mismo procesador, maximizando la utilización del hardware. Sin embargo, el uso de multi- y many-core en CRTES también acarrea ciertos desafíos relacionados con la aportación de evidencias sobre el correcto funcionamiento del sistema, especialmente en el ámbito temporal. Por eso, a pesar de las ventajas de los procesadores many-core y del hecho de que éstos son una realidad en los sitemas integrados (por ejemplo Kalray MPPA, Freescale NXP P4080, TI Keystone II), su uso en CRTES aún precisa de la búsqueda de métodos eficientes para proveer evidencias fiables sobre el correcto funcionamiento del sistema. Esta tesis ahonda en el uso de procesadores many-core en CRTES como un medio para satisfacer los requisitos de rendimiento de aplicaciones complejas mientras proveen las garantías de tiempo necesarias. Para ello, esta tesis contribuye en el avance del estado del arte hacia la explotación de many-cores en CRTES en dos ámbitos de la computación. En el ámbito del hardware, esta tesis propone nuevos diseños many-core que posibilitan garantías de tiempo fiables y precisas. En el ámbito del software, la tesis presenta técnicas eficientes para la planificación de tareas y el análisis de tiempo para aprovechar las capacidades de paralelización en arquitecturas many-core, y también para derivar estimaciones de peor tiempo de ejecución (Worst-Case Execution Time, WCET) fiables y precisas
Run-time management for future MPSoC platforms
In recent years, we are witnessing the dawning of the Multi-Processor Systemon- Chip (MPSoC) era. In essence, this era is triggered by the need to handle more complex applications, while reducing overall cost of embedded (handheld) devices. This cost will mainly be determined by the cost of the hardware platform and the cost of designing applications for that platform. The cost of a hardware platform will partly depend on its production volume. In turn, this means that ??exible, (easily) programmable multi-purpose platforms will exhibit a lower cost. A multi-purpose platform not only requires ??exibility, but should also combine a high performance with a low power consumption. To this end, MPSoC devices integrate computer architectural properties of various computing domains. Just like large-scale parallel and distributed systems, they contain multiple heterogeneous processing elements interconnected by a scalable, network-like structure. This helps in achieving scalable high performance. As in most mobile or portable embedded systems, there is a need for low-power operation and real-time behavior. The cost of designing applications is equally important. Indeed, the actual value of future MPSoC devices is not contained within the embedded multiprocessor IC, but in their capability to provide the user of the device with an amount of services or experiences. So from an application viewpoint, MPSoCs are designed to ef??ciently process multimedia content in applications like video players, video conferencing, 3D gaming, augmented reality, etc. Such applications typically require a lot of processing power and a signi??cant amount of memory. To keep up with ever evolving user needs and with new application standards appearing at a fast pace, MPSoC platforms need to be be easily programmable. Application scalability, i.e. the ability to use just enough platform resources according to the user requirements and with respect to the device capabilities is also an important factor. Hence scalability, ??exibility, real-time behavior, a high performance, a low power consumption and, ??nally, programmability are key components in realizing the success of MPSoC platforms. The run-time manager is logically located between the application layer en the platform layer. It has a crucial role in realizing these MPSoC requirements. As it abstracts the platform hardware, it improves platform programmability. By deciding on resource assignment at run-time and based on the performance requirements of the user, the needs of the application and the capabilities of the platform, it contributes to ??exibility, scalability and to low power operation. As it has an arbiter function between different applications, it enables real-time behavior. This thesis details the key components of such an MPSoC run-time manager and provides a proof-of-concept implementation. These key components include application quality management algorithms linked to MPSoC resource management mechanisms and policies, adapted to the provided MPSoC platform services. First, we describe the role, the responsibilities and the boundary conditions of an MPSoC run-time manager in a generic way. This includes a de??nition of the multiprocessor run-time management design space, a description of the run-time manager design trade-offs and a brief discussion on how these trade-offs affect the key MPSoC requirements. This design space de??nition and the trade-offs are illustrated based on ongoing research and on existing commercial and academic multiprocessor run-time management solutions. Consequently, we introduce a fast and ef??cient resource allocation heuristic that considers FPGA fabric properties such as fragmentation. In addition, this thesis introduces a novel task assignment algorithm for handling soft IP cores denoted as hierarchical con??guration. Hierarchical con??guration managed by the run-time manager enables easier application design and increases the run-time spatial mapping freedom. In turn, this improves the performance of the resource assignment algorithm. Furthermore, we introduce run-time task migration components. We detail a new run-time task migration policy closely coupled to the run-time resource assignment algorithm. In addition to detailing a design-environment supported mechanism that enables moving tasks between an ISP and ??ne-grained recon??gurable hardware, we also propose two novel task migration mechanisms tailored to the Network-on-Chip environment. Finally, we propose a novel mechanism for task migration initiation, based on reusing debug registers in modern embedded microprocessors. We propose a reactive on-chip communication management mechanism. We show that by exploiting an injection rate control mechanism it is possible to provide a communication management system capable of providing a soft (reactive) QoS in a NoC. We introduce a novel, platform independent run-time algorithm to perform quality management, i.e. to select an application quality operating point at run-time based on the user requirements and the available platform resources, as reported by the resource manager. This contribution also proposes a novel way to manage the interaction between the quality manager and the resource manager. In order to have a the realistic, reproducible and ??exible run-time manager testbench with respect to applications with multiple quality levels and implementation tradev offs, we have created an input data generation tool denoted Pareto Surfaces For Free (PSFF). The the PSFF tool is, to the best of our knowledge, the ??rst tool that generates multiple realistic application operating points either based on pro??ling information of a real-life application or based on a designer-controlled random generator. Finally, we provide a proof-of-concept demonstrator that combines these concepts and shows how these mechanisms and policies can operate for real-life situations. In addition, we show that the proposed solutions can be integrated into existing platform operating systems
Recommended from our members
Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors
textThroughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the overall memory system throughput.Computer Science
매니코어 NoC 아키텍처에 대한 고속 사이클-근사 시뮬레이션 기법
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 하순회.Simulation is a software technique that uses the current available architecture to prototype a future architecture. In computer architecture research, simulation techniques are one of the most important skills. Simulation techniques enable us to obtain important performance indicators of new architectures and to perform the design space exploration using these metrics. Furthermore, the simulator enables rapid software development and optimization on the architecture that does not exist. Despite various known problems, such as slow speed or coverage issue, the reliance on simulation technology in computer architecture research continues to increase.
As the density of transistor increases and the performance improvement of the single core hits the ceiling, the newly constructed architectures usually consist of multi/many cores with the network-on-chip, which enables scalable communications. In addition, the implementation of the application itself has also been complicated to effectively utilize these parallel architectures. Thus, simulators for parallel architectures and parallel applications have become extremely complex, and existing sequential simulators no longer simulate these systems at a realistic time.
While many of parallel simulation techniques are being developed to solve these problems, they suffer from poor simulation performance or accuracy. In this thesis, we propose and evaluate a novel many-core simulation technique that can obtain the best simulation performance at the cost of minimum simulation error.
The proposed parallel many-core simulator is divided into three parts: 1) core simulator, 2) network-on-chip simulator, and 3) simulation backplane. Each core is executed by a core simulator, which communicates with the external simulation backplane via the Interprocess Communication (IPC). Each core simulation is performed individually in a separate host processor. The simulation backplane arranges messages from each core into chronological order, passes them to destination modules, and simulates hardware components other than cores. If the simulation backplane generates a request requiring NoC communication, this request is forwarded to the network simulator and is simulated at the most accurate accuracy level.
In this thesis, we proposed a novel core simulation model, which combined analytical and sampled simulations. The core simulator presents 11.36 to 44.31 MIPS performance, while the simulation error is approximately 8 percent. The standalone core simulator is released as an open-source.
We confirmed that NoC simulation has a great effect on the reliability of outputs generated from many-core simulation. First, existing flit-level NoC simulators were analyzed at source-code level. Based on the observations, various implementations were evaluated and various software optimizations was applied to improve the network simulation performance. The proposed NoC simulator presents more than 100KCycles/s performance unless the packet injection rate exceeds 0.00625, which is two times faster than state-of-the-arts NoC simulator at least.
The speed of the simulation backplane depends greatly on the IPC overhead and SystemC scheduling overhead. To reduce the IPC overhead, the trace-driven co-simulation technique is used, faster IPC is introduced, and the segmented L1 data cache is embedded in a core simulator. In addition, to reduce SystemC scheduling overhead, it is important to reduce the number of modules that are simultaneously awakened. To this end, slave modules are redesigned to be activated only based on an event. A new scheduler parallelization technique is also studied. Although the newly developed SystemC parallel scheduler showed good performance under limited conditions, we also confirmed that no performance improvement was found in the TLM level many-core simulator developed in this thesis.
While the proposed many-core simulator uses the conservative synchronization technique which is free from causality errors and performs an accurate flit-level NoC simulation, the simulation performance is still acceptable, thanks to parallelism and optimizations. Additionally, the simulator is highly scalable to add other modules because the simulation backplane is developed to be compatible with SystemC TLM 2.0 standard. Although extensive experiments on accuracy are not conducted, it will be complemented when a detailed specification of the target architecture is given.
This dissertation can be a reference to the development of a many-core simulator, which will be more essential in the future.Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Contribution 4
1.3 Dissertation Organization 5
Chapter 2 Background and Existing Research 6
2.1 Terminologies 6
2.1.1 Simulation Host / Simulation Target 6
2.1.2 Simulated Time / Simulation Time
2.1.3 User-level Simulation / Full-system Simulation 7
2.1.4 Execution-driven Simulation / Trace-driven Simulation 7
2.2 State-of-the-arts Many-core Simulators 8
2.2.1 Gem5 8
2.2.2 Marss 9
2.2.3 Sniper 9
2.2.4 Zsim 9
2.2.5 Manifold 10
2.2.6 Hornet 10
2.2.7 Summary 11
2.3 Host and Target Architecture 12
Chapter 3 Core Simulation 14
3.1 Overview 14
3.2 Related Works 16
3.2.1 Timing Models 16
3.2.2 Analytical Model: Interval Simulation 19
3.3 Sampling Mechanism 23
3.3.1 Sampling Configuration 24
3.3.2 Parameter Extraction 24
3.4 Trace Analyzer 27
3.4.1 Dependency Analysis 29
3.4.2 Life Cycle of An Instruction 31
3.5 Experimental Results 32
3.5.1 Time-accuracy Trade-off 34
3.5.2 Simulation Accuracy 37
3.5.3 Simulation Performance 41
3.6 Discussion 42
Chapter 4 NoC Simulation 45
4.1 Network-on-chip 45
4.2 Motivation 46
4.3 Related Works 48
4.3.1 Noxim 49
4.3.2 Booksim2 50
4.3.3 Garnet 51
4.4 Proposed Approach 51
4.4.1 Implementations 51
4.4.2 Optimizations 54
4.5 Experimental Results 56
4.5.1 Impact of Implementations and Optimizations 56
4.5.2 Comparison with Other State-Of-The-Arts 58
4.5.3 Performance Evaluation For Various Configurations 59
4.5.4 Full-System Simulation Accuracy Impact 59
4.5.5 Accuracy 61
4.6 Discussion 61
Chapter 5 Simulation Backplane 63
5.1 Overview 63
5.2 Background 65
5.2.1 SystemC 65
5.2.2 OSCI Transaction Level Modeling Standard 2.0 66
5.2.3 Synchronization Techniques 67
5.3 SystemC Models for the Target Architecture 69
5.4 Reducing the Cost of Interprocess Communications 71
5.4.1 Trace-driven Co-simulation 71
5.4.2 Better Interprocess Communication 73
5.4.3 Virtually embedding modules to core simulator 74
5.5 Reducing SystemC Scheduling Overhead 76
5.5.1 Event-based Slave Module Activation 76
5.5.2 SystemC Scheduler Parallelization 78
5.6 Evaluation 79
5.6.1 Scalability Test 79
5.6.2 Simulation Performance 79
5.6.3 Simulation Accuracy 80
Chapter 6 Simulation Backplane Parallelization 81
6.1 Background: OSCI SystemC Scheduler 81
6.2 Related Work: SystemC Parallelization Techniques 82
6.2.1 Fully-synchronous Approach 82
6.2.2 Parallel Distributed Event Scheduling (PDES) Approach 82
6.2.3 Out-of-order Execution with Dependency Analysis 83
6.2.4 Dynamic Offloading Approach 84
6.3 Proposed Technique 84
6.3.1 Basic Synchronization 85
6.3.2 Relaxed Synchronization 86
6.3.3 Modeling Restrictions 88
6.4 Experimental Results 89
6.4.1 Performance 90
6.4.2 Accuracy 92
6.5 Discussion and Limitation 93
Chapter 7 Conclusion 95
Bibliography 97
요약 107Docto
Design, implementation and experimental evaluation of a network-slicing aware mobile protocol stack
Mención Internacional en el título de doctorWith the arrival of new generation mobile networks, we currently observe a paradigm
shift, where monolithic network functions running on dedicated hardware are now
implemented as software pieces that can be virtualized on general purpose hardware
platforms. This paradigm shift stands on the softwarization of network functions and
the adoption of virtualization techniques. Network Function Virtualization (NFV)
comprises softwarization of network elements and virtualization of these components.
It brings multiple advantages: (i) Flexibility, allowing an easy management of the virtual
network functions (VNFs) (deploy, start, stop or update); (ii) efficiency, resources can be
adequately consumed due to the increased flexibility of the network infrastructure; and
(iii) reduced costs, due to the ability of sharing hardware resources. To this end, multiple
challenges must be addressed to effectively leverage of all these benefits.
Network Function Virtualization envisioned the concept of virtual network, resulting in
a key enabler of 5G networks flexibility, Network Slicing. This new paradigm represents
a new way to operate mobile networks where the underlying infrastructure is "sliced"
into logically separated networks that can be customized to the specific needs of the
tenant. This approach also enables the ability of instantiate VNFs at different locations
of the infrastructure, choosing their optimal placement based on parameters such as the
requirements of the service traversing the slice or the available resources. This decision
process is called orchestration and involves all the VNFs withing the same network slice.
The orchestrator is the entity in charge of managing network slices. Hands-on experiments
on network slicing are essential to understand its benefits and limits, and to validate the
design and deployment choices. While some network slicing prototypes have been built
for Radio Access Networks (RANs), leveraging on the wide availability of radio hardware
and open-source software, there is no currently open-source suite for end-to-end network
slicing available to the research community. Similarly, orchestration mechanisms must
be evaluated as well to properly validate theoretical solutions addressing diverse aspects
such as resource assignment or service composition.
This thesis contributes on the study of the mobile networks evolution regarding its
softwarization and cloudification. We identify software patterns for network function
virtualization, including the definition of a novel mobile architecture that squeezes the virtualization architecture by splitting functionality in atomic functions.
Then, we effectively design, implement and evaluate of an open-source network
slicing implementation. Our results show a per-slice customization without paying the
price in terms of performance, also providing a slicing implementation to the research
community. Moreover, we propose a framework to flexibly re-orchestrate a virtualized
network, allowing on-the-fly re-orchestration without disrupting ongoing services. This
framework can greatly improve performance under changing conditions. We evaluate
the resulting performance in a realistic network slicing setup, showing the feasibility and
advantages of flexible re-orchestration.
Lastly and following the required re-design of network functions envisioned during
the study of the evolution of mobile networks, we present a novel pipeline architecture
specifically engineered for 4G/5G Physical Layers virtualized over clouds. The proposed
design follows two objectives, resiliency upon unpredictable computing and parallelization
to increase efficiency in multi-core clouds. To this end, we employ techniques such as tight
deadline control, jitter-absorbing buffers, predictive Hybrid Automatic Repeat Request,
and congestion control. Our experimental results show that our cloud-native approach
attains > 95% of the theoretical spectrum efficiency in hostile environments where stateof-
the-art architectures collapse.This work has been supported by IMDEA Networks InstitutePrograma de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Francisco Valera Pintor.- Secretario: Vincenzo Sciancalepore.- Vocal: Xenofon Fouka
Navigating the Landscape for Real-time Localisation and Mapping for Robotics, Virtual and Augmented Reality
Visual understanding of 3D environments in real-time, at low power, is a huge
computational challenge. Often referred to as SLAM (Simultaneous Localisation
and Mapping), it is central to applications spanning domestic and industrial
robotics, autonomous vehicles, virtual and augmented reality. This paper
describes the results of a major research effort to assemble the algorithms,
architectures, tools, and systems software needed to enable delivery of SLAM,
by supporting applications specialists in selecting and configuring the
appropriate algorithm and the appropriate hardware, and compilation pathway, to
meet their performance, accuracy, and energy consumption goals. The major
contributions we present are (1) tools and methodology for systematic
quantitative evaluation of SLAM algorithms, (2) automated,
machine-learning-guided exploration of the algorithmic and implementation
design space with respect to multiple objectives, (3) end-to-end simulation
tools to enable optimisation of heterogeneous, accelerated architectures for
the specific algorithmic requirements of the various SLAM algorithmic
approaches, and (4) tools for delivering, where appropriate, accelerated,
adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.Comment: Proceedings of the IEEE 201
Quarc: an architecture for efficient on-chip communication
The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems.
Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm.
By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission
commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence.
To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of
the building blocks of the architecture, including topology, router and network interface.
The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture.
Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication
- …