5 research outputs found

    First Application of Lattice QCD to Pezy-SC Processor

    Get PDF
    AbstractPezy-SC processor is a novel new architecture developed by Pezy Computing K. K. that has achieved large computational power with low electric power consumption. It works as an accelerator device similarly to GPGPUs. A programming environment that resembles OpenCL is provided. Using a hybrid parallel system “Suiren” installed at KEK, we port and tune a simulation code of lattice QCD, which is computational elementary particle physics based on Monte Carlo method. We offload an iterative solver of a linear equation for a fermion matrix, which is in general the most time consuming part of the lattice QCD simulations. On single and multiple Pezy-SC devices, the sustained performance is measured for the matrix multiplications and a BiCGStab solver. We examine how the data layout affects the performance. The results demonstrate that the Pezy-SC processors provide a feasible environment to perform numerical lattice QCD simulations

    A Performance Model For Gpu Architectures: Analysis And Design Of Fundamental Algorithms

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2018

    First Application of Lattice QCD to Pezy-SC Processor

    No full text

    Advanced management techniques for many-core communication systems

    Get PDF
    The way computer processors are built is changing. Nowadays, computer processor performance is increased by adding more processing cores on a single chip instead of making processors larger and faster. The traditional approach is no longer viable, due to limits in transistor scaling. Both industry and academia agree that scaling the number of processing cores to hundreds or thousands on a single chip is the only way to scale computer processor performance from now on. Consequently, the performance of these future many-core systems with thousands of cores will heavily depend on the Network-on-Chip (NoC) architecture to provide scalable communication. Therefore, as the number of cores increases the locality will only become more important. Communication locality is essential to reduce latency and increase performance. Many-core systems should be designed such that cores communicate mainly to the neighbouring cores, in order to minimise the communication cost. We investigate the network performance of different topologies using the ITRS physical data for the year 2023. For this reason, we propose abstract synthetic traffic generation models to explore the locality behaviour in many-core NoC systems. Using the synthetic traffic models - group clustering model and ring clustering model - traffic distance metrics may be adjusted with locality parameters. We choose two many-core NoC architectures - distributed memory architecture and shared memory architecture - to examine whether enforcing locality on different architectures may have a diverse effect on the network performance of different topologies. Distributed memory architecture uses the message passing method of communication to communicate between cores. Our results show that the degree of locality and the clustering model strongly affect the performance of the network. Scale-invariant topologies, such as the fat quadtree, perform worse than flat ones because the reduced hop count is outweighed by the longer wire delays. In shared memory architecture, threads communicate with each other by storing data in shared cache lines. We design a hierarchical cache model that benefits from communication locality because many-core cache hierarchy that fails to exploit locality may end up having more cores delayed, thereby decreasing the network performance. Our results show that the locality model of thread placement and the distance of placing them significantly affect the NoC performance. Furthermore, they show that scale-invariant topologies perform better than flat topologies. Then, we demonstrate that implementing directory-based cache coherency has only a small overhead on the cache size. Using cache coherency protocol in our proposed hierarchical cache model, we show that network performance decreases only slightly. Hence, cache coherency scales, and it is possible to have shared memory architecture with thousands of cores

    オンライン学習可能なサルスケール人工小脳の構築と機械制御への応用

    Get PDF
     脳は生命活動に必要不可欠な器官でありながらも,その機構について多くの未知が残る器官である.その解明の為に数値シミュレーションを利用することは有効だが,その為には脳のモデル化のために多くの情報が必要となる.小脳についてはその機能と構造についての文献が多く,例えば小脳は身体の運動制御を行い,学習機能を有していると言われ,神経回路のネットワークについても詳細が判明している.従ってそのデータに基づいて数理モデルを組み立て,数値計算を行うことで,計算機上に人工小脳を作成することができる. 人工の脳が必要とされる最たる理由として,動物に負担をかけてしまう動物実験を,人工脳を用いた実験に置き換える事が挙げられる.置き換えに堪える人工脳を開発する為には,脳を構成するニューロン数やシナプス数,更にはそれぞれのニューロンがどの様に結合しているかといった回路構造を緻密に再現し,実時間で動作させる必要がある.しかしながら実在の動物の脳は,例えばヒトでは約1千億個のニューロンから構成されていると言われており,実時間で動作する人工小脳を構築するためには莫大な計算資源が必要である為,スパコン等に実装されてきた. アクセラレータの1種であるPEZY-SCの後継モデルである,PEZY-SC2 を使用したスーパーコンピュータ“Gyoukou”が海洋研究開発機構に設置されて利用可能になったことを受けて,本研究ではネコスケール小脳モデルのPEZY-SC2 向けの最適化とスケールアップを兼ねた移植作業を行った.Gyoukouの持つ10,000個のプロセッサの内,7,920個のプロセッサを使用して約80億ニューロンからなる人工小脳を構築し,眼球の単純反射運動のシミュレーションを行い,正しく学習を行える様子を確認した.この数字はサル2匹分の小脳が持つニューロン数と同等であるので,我々は今回構築したモデルを,サルスケール人工小脳と呼んでいる.さらに,構築した人工小脳が,単純な眼球の反射運動だけでなく,より複雑な運動制御も行えることを確認するために,多関節ロボットアームの制御とハンドロボットの制御を行った.小脳回路と同じ計算能力を持ち,小脳回路に構造が近しいEcho State Network(ESN) を用いて制御を行えることを確認し,人工小脳で運動制御を行うことができることを示した. これらの結果は,構築した人工小脳を実際の運動制御に利用できることを示唆し,将来的には,事故等で小脳を損傷してしまった患者の運動制御を,人工小脳で補助することなどに応用されると期待される.電気通信大学201
    corecore