573 research outputs found

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers

    Get PDF
    The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism

    MPI parallelization of fast algorithm codes developed using SIE/VIE and P-FFT method

    Get PDF
    Master'sMASTER OF ENGINEERIN

    The spatiotemporal organization of cerebellar network activity resolved by two-photon imaging of multiple single neurons

    Get PDF
    In order to investigate the spatiotemporal organization of neuronal activity in local microcircuits, techniques allowing the simultaneous recording from multiple single neurons are required. To this end, we implemented an advanced spatial-light modulator two-photon microscope (SLM-2PM). A critical issue for cerebellar theory is the organization of granular layer activity in the cerebellum, which has been predicted by single-cell recordings and computational models. With SLM-2PM, calcium signals could be recorded from different network elements in acute cerebellar slices including granule cells (GrCs), Purkinje cells (PCs) and molecular layer interneurons. By combining WCRs with SLM-2PM, the spike/calcium relationship in GrCs and PCs could be extrapolated toward the detection of single spikes. The SLM-2PM technique made it possible to monitor activity of over tens to hundreds neurons simultaneously. GrC activity depended on the number of spikes in the input mossy fiber bursts. PC and molecular layer interneuron activity paralleled that in the underlying GrC population revealing the spread of activity through the cerebellar cortical network. Moreover, circuit activity was increased by the GABA-A receptor blocker, gabazine, and reduced by the AMPA and NMDA receptor blockers, NBQX and APV. The SLM-2PM analysis of spatiotemporal patterns lent experimental support to the time-window and center-surround organizing principles of the granular layer

    DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments

    Get PDF
    We describe the development of an FX style correlator for Very Long Baseline Interferometry (VLBI), implemented in software and intended to run in multi-processor computing environments, such as large clusters of commodity machines (Beowulf clusters) or computers specifically designed for high performance computing, such as multi-processor shared-memory machines. We outline the scientific and practical benefits for VLBI correlation, these chiefly being due to the inherent flexibility of software and the fact that the highly parallel and scalable nature of the correlation task is well suited to a multi-processor computing environment. We suggest scientific applications where such an approach to VLBI correlation is most suited and will give the best returns. We report detailed results from the Distributed FX (DiFX) software correlator, running on the Swinburne supercomputer (a Beowulf cluster of approximately 300 commodity processors), including measures of the performance of the system. For example, to correlate all Stokes products for a 10 antenna array, with an aggregate bandwidth of 64 MHz per station and using typical time and frequency resolution presently requires of order 100 desktop-class compute nodes. Due to the effect of Moore's Law on commodity computing performance, the total number and cost of compute nodes required to meet a given correlation task continues to decrease rapidly with time. We show detailed comparisons between DiFX and two existing hardware-based correlators: the Australian Long Baseline Array (LBA) S2 correlator, and the NRAO Very Long Baseline Array (VLBA) correlator. In both cases, excellent agreement was found between the correlators. Finally, we describe plans for the future operation of DiFX on the Swinburne supercomputer, for both astrophysical and geodetic science.Comment: 41 pages, 10 figures, accepted for publication in PAS

    The human VGF-derived bioactive peptide TLQP-21 binds heat shock 71 kDa protein 8 (HSPA8) on the surface of SH-SY5Y cells

    Get PDF
    VGF (non-acronymic)is a secreted chromogranin/secretogranin that gives rise to a number of bioactive peptides by a complex proteolysis mechanism. VGF-derived peptides exert an extensive array of biological effects in energy metabolism, mood regulation, pain, gastric secretion function, reproduction and, perhaps, cancer. It is therefore surprising that very little is known about receptors and binding partners of VGF-derived peptides and their downstream molecular mechanisms of action. Here, using affinity chromatography and mass spectrometry-based protein identification, we have identified the heat shock cognate 71 kDa protein A8 (HSPA8)as a binding partner of human TLQP-21 on the surface of human neuroblastomaSH-SY5Y cells. Binding of TLQP-21 to membrane associated HSPA8 in live SH-SY5Y cells was further supported by cross-linking to live cells. Interaction between HSPA8 and TLQP-21 was confirmed in vitro by label-free Dynamic Mass Redistribution (DMR) studies. Furthermore, molecular modeling studies show that TLQP-21 can be docked into the HSPA8 peptide binding pocket. Identification of HSPA8 as a cell surface binding partner of TLQP-21 opens new avenues to explore the molecular mechanisms of its physiological actions, and of pharmacological modulation thereof.This work was supported by ERANETNEURON grant DISCover through the Spanish funding partner Instituto de Salud Carlos III (www.isciii.es), grant: PI09/2688 to JRRS

    The Performance of the Robo-AO Laser Guide Star Adaptive Optics System at the Kitt Peak 2.1-m Telescope

    Get PDF
    Robo-AO is an autonomous laser guide star adaptive optics system recently commissioned at the Kitt Peak 2.1-m telescope. Now operating every clear night, Robo-AO at the 2.1-m telescope is the first dedicated adaptive optics observatory. This paper presents the imaging performance of the adaptive optics system in its first eighteen months of operations. For a median seeing value of 1.311.31^{\prime\prime}, the average Strehl ratio is 4\% in the ii^\prime band and 29\% in the J band. After post-processing, the contrast ratio under sub-arcsecond seeing for a 2i162\leq i^{\prime} \leq 16 primary star is five and seven magnitudes at radial offsets of 0.50.5^{\prime\prime} and 1.01.0^{\prime\prime}, respectively. The data processing and archiving pipelines run automatically at the end of each night. The first stage of the processing pipeline shifts and adds the data using techniques alternately optimized for stars with high and low SNRs. The second "high contrast" stage of the pipeline is eponymously well suited to finding faint stellar companions.Comment: 12 pages, 16 figures, to be submitted to PAS

    Parallel cryptanalysis

    Get PDF
    Most of today’s cryptographic primitives are based on computations that are hard to perform for a potential attacker but easy to perform for somebody who is in possession of some secret information, the key, that opens a back door in these hard computations and allows them to be solved in a small amount of time. To estimate the strength of a cryptographic primitive it is important to know how hard it is to perform the computation without knowledge of the secret back door and to get an understanding of how much money or time the attacker has to spend. Usually a cryptographic primitive allows the cryptographer to choose parameters that make an attack harder at the cost of making the computations using the secret key harder as well. Therefore designing a cryptographic primitive imposes the dilemma of choosing the parameters strong enough to resist an attack up to a certain cost while choosing them small enough to allow usage of the primitive in the real world, e.g. on small computing devices like smart phones. This thesis investigates three different attacks on particular cryptographic systems: Wagner’s generalized birthday attack is applied to the compression function of the hash function FSB. Pollard’s rho algorithm is used for attacking Certicom’s ECC Challenge ECC2K-130. The implementation of the XL algorithm has not been specialized for an attack on a specific cryptographic primitive but can be used for attacking some cryptographic primitives by solving multivariate quadratic systems. All three attacks are general attacks, i.e. they apply to various cryptographic systems; the implementations of Wagner’s generalized birthday attack and Pollard’s rho algorithm can be adapted for attacking other primitives than those given in this thesis. The three attacks have been implemented on different parallel architectures. XL has been parallelized using the Block Wiedemann algorithm on a NUMA system using OpenMP and on an Infiniband cluster using MPI. Wagner’s attack was performed on a distributed system of 8 multi-core nodes connected by an Ethernet network. The work on Pollard’s Rho algorithm is part of a large research collaboration with several research groups; the computations are embarrassingly parallel and are executed in a distributed fashion in several facilities with almost negligible communication cost. This dissertation presents implementations of the iteration function of Pollard’s Rho algorithm on Graphics Processing Units and on the Cell Broadband Engine
    corecore