344 research outputs found

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

    Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers

    Get PDF
    The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism

    Optimum Circuits for Bit Reversal

    Full text link

    Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms

    Get PDF
    This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ‘repeated computation with a possibility of premature termination’. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach. The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languages—one for the implementation and one for the interface—for our implementation of computer algebra algorithms. Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the ‘parallel penalty’. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods. We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassen’s matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fraction—an approach, known from literature. We also performed execution time estimations of our divide and conquer programs. This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called ‘parallel repeated computation’, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the Rabin–Miller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution. The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the Gauß elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms. Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms

    Delta rhythms as a substrate for holographic processing in sleep and wakefulness

    Get PDF
    PhD ThesisWe initially considered the theoretical properties and benefits of so-called holographic processing in a specific type of computational problem implied by the theories of synaptic rescaling processes in the biological wake-sleep cycle. This raised two fundamental questions that we attempted to answer by an experimental in vitro electrophysiological approach. We developed a comprehensive experimental paradigm based on a pharmacological model of the wake-sleep-associated delta rhythm measured with a Utah micro-electrode array at the interface between primary and associational areas in the rodent neocortex. We first verified that our in vitro delta rhythm model possessed two key features found in both in vivo rodent and human studies of synaptic rescaling processes in sleep: The first property being that prior local synaptic potentiation in wake leads to increased local delta power in subsequent sleep. The second property is the reactivation in sleep of neural firing patterns observed prior to sleep. By reproducing these findings we confirmed that our model is arguably an adequate medium for further study of the putative sleep-related synaptic rescaling process. In addition we found important differences between neural units that reactivated or deactivated during delta; these were differences in cell types based on unit spike shapes, in prior firing rates and in prior spike-train-to-local-field-potential coherence. Taken together these results suggested a mechanistic chain of explanation of the two observed properties, and set the neurobiological framework for further, more computationally driven analysis. Using the above experimental and theoretical substrate we developed a new method of analysis of micro-electrode array data. The method is a generalization to the electromagnetic case of a well-known technique for processing acoustic microphone array data. This allowed calculation of: The instantaneous spatial energy flow and dissipation in the neocortical areas under the array; The spatial energy source density in analogy to well-known current source density analysis. We then refocused our investigation on the two theoretical questions that we hoped to achieve experimental answers for: Whether the state of the neocortex during a delta rhythm could be described by ergodic statistics, which we determined by analyzing the spectral properties of energy dissipation as a signature of the state of the dynamical system; A more explorative approach prompting an investigation of the spatiotemporal interactions across and along neocortical layers and areas during a delta rhythm, as implied by energy flow patterns. We found that the in vitro rodent neocortex does not conform to ergodic statistics during a pharmacologically driven delta or gamma rhythm. We also found a delta period locked pattern of energy flow across and along layers and areas, which doubled the processing cycle relative to the fundamental delta rhythm, tentatively suggesting a reciprocal, two-stage information processing hierarchy similar to a stochastic Helmholtz machine with a wake-sleep training algorithm. Further, the complex valued energy flow might suggest an improvement to the Helmholtz machine concept by generalizing the complex valued weights of the stochastic network to higher dimensional multi-vectors of a geometric algebra with a metric particularity suited for holographic processes. Finally, preliminary attempts were made to implement and characterize the above network dynamics in silico. We found that a qubit valued network does not allow fully holographic processes, but tentatively suggest that an ebit valued network may display two key properties of general holographic processing

    Adaptive and secured resource management in distributed and Internet systems

    Get PDF
    The effectiveness of computer system resource management has been always determined by two major factors: (1) workload demands and management objectives, (2) the updates of the computer technology. These two factors are dynamically changing, and resource management systems must be timely adaptive to the changes. This dissertation attempts to address several important and related resource management issues.;We first study memory system utilization in centralized servers by improving memory performance of sorting algorithms, which provides fundamental understanding on memory system organizations and its performance optimizations for data-intensive workloads. to reduce different types of cache misses, we restructure the mergesort and quicksort algorithms by integrating tiling, padding, and buffering techniques and by repartitioning the data set. Our study shows substantial performance improvements from our new methods.;We have further extended the work to improve load sharing for utilizing global memory resources in distributed systems. Aiming at reducing the memory resource contention caused by page faults and I/O activities, we have developed and examined load sharing policies by considering effective usage of global memory in addition to CPU load balancing in both homogeneous and heterogeneous clusters.;Extending our research from clusters to Internet systems, we have further investigated memory and storage utilizations in Web caching systems. We have proposed several novel management schemes to restructure and decentralize the existing caching system by exploiting data locality at different levels of the global memory hierarchy and by effectively sharing data objects among the clients and their proxy caches.;Data integrity and communication anonymity issues are raised from our decentralized Web caching system design, which are also security concerns for general peer-to-peer systems. We propose an integrity protocol to ensure data integrity, and several protocols to achieve mutual communication anonymity between an information requester and a provider.;The potential impact and contributions of this dissertation are briefly stated as follows: (1) two major research topics identified in this dissertation are fundamentally important for the growth and development of information technology, and will continue to be demanding topics for a long term. (2) Our proposed cache-effective sorting methods bridge a serious gap between analytical complexity of algorithms and their execution complexity in practice due to the increasingly deep memory hierarchy in computer systems. This approach can also be used to improve memory performance at different levels of the memory hierarchy, such as I/O and file systems. (3) Our load sharing principle of giving a high priority to the requests of data accesses in memory and I/Os timely adapts the technology changes and effectively responds to the increasing demand of data-intensive applications. (4) Our proposed decentralized Web caching framework and its resource management schemes present a comprehensive case study to examine the P2P model. Our results and experiences can be used for related and further studies in distributed computing. (5) The proposed data integrity and communication anonymity protocols address limits and weaknesses of existing ones, and place a solid foundation for us to continue our work in this important area

    Multi-criteria decision-making in whole process design

    Get PDF
    PhD ThesisIn recent years, the chemical and pharmaceutical industries have faced increased development times and costs with fewer novel chemicals being discovered. This has resulted in many companies focusing on innovative research and development as they consider this key to business success. In particular, a number of leading industrial organisations have adopted the principles of Whole Process Design (WPD). WPD considers the optimisation of the entire product development process, from raw materials to end product, rather than focusing on each individual unit operation. The complexity involved in the implementation of WPD requires rationalised decision-making, often with limited or uncertain information. This thesis assesses the most widely applied methods in Multi-Criteria Decision Analysis (MCDA) in conjunction with the results of two interviews and two questionnaires that identified the industrial requirements for decision-making during WPD. From the findings of this work, a novel decision-making methodology was proposed, the outcome of which allows a decision-maker to visually interpret their decision results with associated levels of uncertainty. To validate the proposed methodology, a software framework was developed that incorporates two other decision-making approaches, the Analytical Hierarchy Process (AHP) and ELimination Et Choix Traduisant la REalité trois (ELECTRE III). The framework was then applied to a number of industrial case studies to validate the application of the proposed methodology.Engineering and Physical Sciences Research Council (EPSRC) and Chemistry Innovatio

    Techniques of design optimisation for algorithms implemented in software

    Get PDF
    The overarching objective of this thesis was to develop tools for parallelising, optimising, and implementing algorithms on parallel architectures, in particular General Purpose Graphics Processors (GPGPUs). Two projects were chosen from different application areas in which GPGPUs are used: a defence application involving image compression, and a modelling application in bioinformatics (computational immunology). Each project had its own specific objectives, as well as supporting the overall research goal. The defence / image compression project was carried out in collaboration with the Jet Propulsion Laboratories. The specific questions were: to what extent an algorithm designed for bit-serial for the lossless compression of hyperspectral images on-board unmanned vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to implement that algorithm, and whether a software implementation with or without GPGPU acceleration could match the throughput of a dedicated hardware (FPGA) implementation. The dependencies within the algorithm were analysed, and the algorithm parallelised. The algorithm was implemented in software for GPGPU, and optimised. During the optimisation process, profiling revealed less than optimal device utilisation, but no further optimisations resulted in an improvement in speed. The design had hit a local-maximum of performance. Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation metric of kernel occupancy used for GPU optimisation. Redesigning the implementation with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board implementation of the CCSDS lossless hyperspectral image compression algorithm, exceeding the performance of the hardware reference implementation, and providing sufficient throughput for the next generation of image sensor as well. The second project was carried out in collaboration with biologists at the University of Arizona and involved modelling a complex biological system – VDJ recombination involved in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor and antibodies) by VDJ recombination is an enormously complex process, which can theoretically synthesize greater than 1018 variants. Originally thought to be a random process, the underlying mechanisms clearly have a non-random nature that preferentially creates a small subset of immune receptors in many individuals. Understanding this bias is a longstanding problem in the field of immunology. Modelling the process of VDJ recombination to determine the number of ways each immune receptor can be synthesized, previously thought to be untenable, is a key first step in determining how this special population is made. The computational tools developed in this thesis have allowed immunologists for the first time to comprehensively test and invalidate a longstanding theory (convergent recombination) for how this special population is created, while generating the data needed to develop novel hypothesis

    Representation of statistical sound properties in human auditory cortex

    Get PDF
    The work carried out in this doctoral thesis investigated the representation of statistical sound properties in human auditory cortex. It addressed four key aspects in auditory neuroscience: the representation of different analysis time windows in auditory cortex; mechanisms for the analysis and segregation of auditory objects; information-theoretic constraints on pitch sequence processing; and the analysis of local and global pitch patterns. The majority of the studies employed a parametric design in which the statistical properties of a single acoustic parameter were altered along a continuum, while keeping other sound properties fixed. The thesis is divided into four parts. Part I (Chapter 1) examines principles of anatomical and functional organisation that constrain the problems addressed. Part II (Chapter 2) introduces approaches to digital stimulus design, principles of functional magnetic resonance imaging (fMRI), and the analysis of fMRI data. Part III (Chapters 3-6) reports five experimental studies. Study 1 controlled the spectrotemporal correlation in complex acoustic spectra and showed that activity in auditory association cortex increases as a function of spectrotemporal correlation. Study 2 demonstrated a functional hierarchy of the representation of auditory object boundaries and object salience. Studies 3 and 4 investigated cortical mechanisms for encoding entropy in pitch sequences and showed that the planum temporale acts as a computational hub, requiring more computational resources for sequences with high entropy than for those with high redundancy. Study 5 provided evidence for a hierarchical organisation of local and global pitch pattern processing in neurologically normal participants. Finally, Part IV (Chapter 7) concludes with a general discussion of the results and future perspectives
    • …