Search CORE

4 research outputs found

カスタムプロセッサ構築用FPGAプラットフォームの開発と評価

Author: 佐藤遼
Publication venue
Publication date: 21/09/2016
Field of study

プロセッサの性能向上は今なお求められており，更なる処理性能向上にはプロセッサアーキテクチャの改善が必要である．プロセッサアーキテクチャの研究ではアイディアの検証はソフトウェアシミュレーションによって行われることが多い．しかしながら，回路規模の増大や処理の複雑化によって評価にかかる時間が増大するという問題が顕在化している．そこで注目すべき解決策が，Field Programmable Gate Array(FPGA) を用いたエミュレーション方法である．FPGA はハードウェアチップを設計するより，簡単に所望の回路を実現することができ，かつハードウェアの動作をソフトウェアよりも高速に模倣させることができる．また，何度でも内容を変更・修正できるため，動作確認のテストを容易に行うことができる．このためFPGA を用いたアーキテクチャ研究の高速化は有用な手段の一つと考えられる．そこで本論文ではARMのISA をベースとするカスタムプロセッサ構築用FPGAプラットフォームを独自に実現することを最終的な目的として，プロセッサを独自に実装し，動作検証と評価を行う．本稿ではカスタムプロセッサ構築用FPGA プラットフォームの実現を最終的な目的としてFPGAプラットフォームの実装と評価を行った．第一に，設計したプロセッサが想定通りに実装できていることを検証した．第二に，様々なプログラムに対してプロセッサが動作可能であることを確認した．第三に，シリアル通信でプロセッサのメモリにアクセスする機構を追加実装した．第四に，PCIe 通信でプロセッサの性能評価指数を出力する機構を追加実装した．最後に，シミュレータとFPGAで，プロセッサの検証にかかる実行時間を比較した．これらの検証結果からFPGA による拡張性，高速動作性を確認し，カスタムプロセッサ構築用FPGAプラットフォームとして有用であることを確認した．加えて今後の研究の方向性として提案システムの改善点をまとめた．電気通信大学201

Creative Repository of Electro-Communications

A Reconfigurable Processor for Heterogeneous Multi-Core Architectures

Author: Grudnitsky Artjom
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2015
Field of study

A reconfigurable processor is a general-purpose processor coupled with an FPGA-like reconfigurable fabric. By deploying application-specific accelerators, performance for a wide range of applications can be improved with such a system. In this work concepts are designed for the use of reconfigurable processors in multi-tasking scenarios and as part of multi-core systems

KITopen

Design of a distributed memory unit for clustered microarchitectures

Author: Bieschewski Stefan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2013
Field of study

Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Recommended from our members

Exploiting tightly-coupled cores

Author: Bates Daniel
Publication venue: University of Cambridge
Publication date: 04/02/2014
Field of study

As we move steadily through the multicore era, and the number of processing cores on each chip continues to rise, parallel computation becomes increasingly important. However, parallelising an application is often difficult because of dependencies between different regions of code which require cores to communicate. Communication is usually slow compared to computation, and so restricts the opportunities for profitable parallelisation. In this work, I explore the opportunities provided when communication between cores has a very low latency and low energy cost. I observe that there are many different ways in which multiple cores can be used to execute a program, allowing more parallelism to be exploited in more situations, and also providing energy savings in some cases. Individual cores can be made very simple and efficient because they do not need to exploit parallelism internally. The communication patterns between cores can be updated frequently to reflect the parallelism available at the time, allowing better utilisation than specialised hardware which is used infrequently. In this dissertation I introduce Loki: a homogeneous, tiled architecture made up of many simple, tightly-coupled cores. I demonstrate the benefits in both performance and energy consumption which can be achieved with this arrangement and observe that it is also likely to have lower design and validation costs and be easier to optimise. I then determine exactly where the performance bottlenecks of the design are, and where the energy is consumed, and look into some more-advanced optimisations which can make parallelism even more profitable

Apollo (Cambridge)