22 research outputs found

    Parallel Natural Language Parsing: From Analysis to Speedup

    Get PDF
    Electrical Engineering, Mathematics and Computer Scienc

    An O(nh) algorithm for dual-server coordinated en-route caching in tree networks

    Get PDF
    Dual-server coordinated en-route caching is important because of its basic features as multi-server en-route caching. In this paper, multi-server coordinated en-route caching is formulated as an optimization problem of minimizing total access cost, including transmission cost for all access demands and caching cost of all caches. We first discuss an algorithm for single-server en-route caching in tree networks and then show that this is a special case of another algorithm for dual-server en-route caching in tree networks whose time complexity is O(nh).Shihong Xu, Hong She

    Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs

    Get PDF
    We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. We evaluate these proposals and make a comparison between a new Fermi Tesla C2050 and an Intel Core 2 QuadQ6700. Speedups of the CUDA version are the best results, improving the execution times on CPU, ranging from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2x factors on small frame sizes to 3x factors on larger ones

    Design and performance analysis of a fast 4-way set associative cache controller using Tree Pseudo Least Recently Used algorithm

    Get PDF
    In the realm of modern computing, cache memory serves as an essential intermediary, mitigating the speed disparity between rapid processors and slower main memory. Central to this study is the development of an innovative cache controller for a 4-way set associative cache, meticulously crafted using VHDL and structured as a Finite State Machine. This controller efficiently oversees a cache of 256 bytes, with each block encompassing 128 bits or 16 bytes, organized into four sets containing four lines each. A key feature of this design is the incorporation of the Tree Pseudo Least Recently Used (PLRU) algorithm for cache replacement, a strategic choice aimed at optimizing cache performance. The effectiveness of this controller was rigorously evaluated using ModelSim, which generated a comprehensive timing diagram to validate the design's functionality, especially when integrated with a segmented main memory of four 1KB banks. The results from this evaluation were promising, showcasing precise logic outputs within the timing diagram. Operational efficiency was evidenced by the controller's swift processing speeds: read hits were completed in a mere three cycles, read misses in five and a half cycles, and both write hits and misses in three and a half cycles. These findings highlight the controller's capability to enhance cache memory efficiency, striking a balance between the complexities of set-associative mapping and the need for optimized performance in contemporary computing systems. This study not only demonstrates the potential of the proposed cache controller design in bridging the processor-memory speed gap but also contributes significantly to the field of cache memory management by offering a viable solution to the challenges posed by traditional cache configurations

    Cache Performance Analysis and Optimization

    Get PDF

    Predicting the cache miss ratio of loop-nested array references

    Get PDF
    The time a program takes to execute can be massively affected by the efficiency with which it utilizes cache memory. Moreover the cache-miss behavior of a program can be highly unpredictable, in that small changes to input parameters can cause large changes in the number of misses. In this paper we present novel analytical models of the cache behavior of programs consisting mainly of array operations inside nested loops, for direct-mapped caches. The models are used to predict the miss-ratios of three example loop nests; the results are shown to be largely within ten percent of simulated values. A significant advantage is that the calculation time is proportional to the number of array references in the program, typically several orders of magnitude faster than traditional cache simulation methods

    Shrinking VOD Traffic via Rényi-Entropic Optimal Transport

    Get PDF
    In response to the exponential surge in Internet Video on Demand (VOD) traffic, numerous research endeavors have concentrated on optimizing and enhancing infrastructure efficiency. In contrast, this paper explores whether users’ demand patterns can be shaped to reduce the pressure on infrastructure. Our main idea is to design a mechanism that alters the distribution of user requests to another distribution which is much more cache-efficient, but still remains ‘close enough’ (in the sense of cost) to fulfil each individual user’s preference. To quantify the cache footprint of VOD traffic, we propose a novel application of Rényi entropy as its proxy, capturing the ‘richness’ (the number of distinct videos or cache size) and the ‘evenness’ (the relative popularity of video accesses) of the on-demand video distribution. We then demonstrate how to decrease this metric by formulating a problem drawing on the mathematical theory of optimal transport (OT). Additionally, we establish a key equivalence theorem: minimizing Rényi entropy corresponds to maximizing soft cache hit ratio (SCHR) — a variant of cache hit ratio allowing similarity-based video substitutions. Evaluation on a real-world, city-scale video viewing dataset reveals a remarkable 83% reduction in cache size (associated with VOD caching traffic). Crucially, in alignment with the above-mentioned equivalence theorem, our approach yields a significant uplift to SCHR, achieving close to 100%

    Parallel computation of the hyperbolic QR factorization

    Get PDF
    U ovom radu prezentirali smo kako računati hiperboličku QR JJ-faktorizaciju. Prvo je postavljena teorija koja nam daje dva načina redukcije matrice GCm×n,mn,G \in \mathbb{C}^{m \times n}, m \geq n, na blok gornjetrokutastu formu. Jedan način je redukcija jednog stupca pomoću JJ-Householderovog reflektora. Razjašnjeni su nužni i dovoljni uvjeti postojanja takvih operatora. Drugi način je redukcija dva stupca koristeći Givensove rotacije. U tom poglavlju je obrađeno što sve zovemo pravilnom (’proper’) formom, kako svesti matrice na pravilnu formu, te kako tu pravilnu formu do kraja reducirati JJ-unitarnim matricama manjih dimenzija. Nadalje, indefinitni QR povezali smo sa još jednom faktorizacijom, hermitskom indefinitnom faktorizacijom. Pokazali smo kako su te dvije faktorizacije povezane i koja je optimalna strategija pivotiranja u hermitskoj indefinitnoj faktorizaciji (ona koja ima najmanji pivotni rast u svakom slučaju izbora strategije, nebitno biraju li se dva ili jedan stupac za redukciju). Ista pivotna strategija primijenjena je i na QR faktorizaciju. Naposljetku, prezentiran je sekvencijalni algoritam redukcije matrice GG na gorenje blok trokutastu formu, kao i njegovi dijelovi koji su paralelizirani. Pri optimizaciji koda, u obzir je uzeta i arhitektura memorije računala, te način funkcioniranja biblioteka OpenMP i MKL koje smo koristili za paralelizaciju. Testiranja na umjetno generiranim matricama izvedena su na Intelovom Xeon Phi 7210 računalu, gdje je također u obzir uzeta posebna memorijska arhitektura računala.In this thesis, a way of computing a JJ-unitary QR factorization was presented. The theoretical part was set first, in order to explain two possible ways of transforming a matrix GCm×n,mn,G \in \mathbb{C}^{m \times n}, m \geq n, into a block upper triangular matrix. One way to do this is with JJ-unitary Householder like reflectors. The necessary and sufficient conditions for their existence were formed. The other way to do this is using Givens rotations. In that chapter, a term proper form was defined, how to transform matrices into proper forms and, in the end, how are those proper forms fully reduced with JJ-unitary matrices of smaller dimensions. Furthermore, we showed how indefinite QR is connected to the Hermitian indefinite factorization. An optimal pivoting strategy for the Hermitian indefinite factorization was presented, based on minimizing the pivot growth (regardless of the fact that one or two columns were chosen as pivotal). The same strategy was then used in the QR factorization. At last, a sequential version of the algorithm for reducing the matrix GG to a block upper triangular form was presented, as well as with the parallelised segments of it. The memory architecture was took into account while optimizing the code as well as the optimal usage of OpenMP and MKL libraries. The tests on randomly generated matrices were performed on Intel’s Xeon Phi 7210. The special architecture of Xeon Phi was also taken into account

    Veröffentlichungen und Vorträge 2007 der Mitglieder der Fakultät für Informatik

    Get PDF

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies
    corecore