2 research outputs found

    Dynamic data memory partitioning for access region caches

    Get PDF
    For wide-issue processors, data cache needs to be heavily multi-ported with extremely wide data-paths. A recent proposal of multi-porting cache design divides memory streams into multiple independent sub-streams with the help of prediction mechanism before they enter the reservation stations. Partitioned memory-reference instructions are then fed into separate memory pipelines, each of which is connected to a small data-cache, called access region cache (ARC). A selection function for mapping memory references to each ARC can affect the data memory bandwidth as conflicts and load balance at each ARC may differ. In this thesis, we study various static and dynamic memory partitioning methods to see the effects of distributing memory references among the ARCS through exposing memory traffic of those designs. Six different approaches of distributing memory references, including two randomization methods and two dynamic methods, are considered. The potential effects on the memory performance with ARC are measured and compared with existing multi-porting solution as well as an ideal multi-ported data cache. This study concludes that scattering access conflicts dynamically, redirecting conflicting references dynamically to different ARCs at each cycle, can increase the memory bandwidth. However, increasing data bandwidth alone does not always results in performance improvement. Keeping the cache miss rate low is as important as sufficient memory bandwidth to achieve higher performance in wide-issue processors

    A study of the scale-invariant feature transform on a parallel pipeline

    Get PDF
     Untitled Page In this thesis we study the running of the Scale Invariant Feature Transform (SIFT) algorithm on a pipelined computational platform. The SIFT algorithm is one of the most widely used methods for image feature extraction. We develop a tile based template for running SIFT that facilitates the analysis while abstracting away lower-level details. We formalize the computational pipeline and the time to execute any algorithm on it based on the relative times taken by the pipeline stages. In the context of the SIFT algorithm, this reduces the time to that of running the entire image through a bottlenecked stage and the time to run either the first or last tile through the remaining stages. Through an experimental study of the SIFT algorithm on a broad collection of test images, we determined image feature fraction values, that relate the sizes of the image extracts as it the computation proceeds through the stages of the SIFT algorithm. We show that for a single chip uniprocessor pipeline, the computational stage is the bottleneck. Specifically we show that for an N x N image with n x n tiles the overall time complexity is θ ( (n+x)2 pi Г0+αβN2x2 Г1+ (αβ+γ)n2logx P0 Г2 ) ; here x is the neigborhood of the tile, pi , po are the number of input, output pins of the chip, α,β,γ are the feature fractions, and Г0, Г1, Г2 are the input, compute, output clocks. The three terms in the expression represents the time complexities of input, compute and output stages. The input and output stages can be slowed down substantially without appreciate degradation of the overall performance. This slowdown can be traded off for lower power and higher signal quantity. For multicore chips, we show that for an N x N image on a P-core chip, the overall time complexity to process the image is θ ( N2 pi Г0+ (n2w2 + αβn2x2) P Г1+ (αβ+γ)n2logx P0 Г2 ) ; in addition to the quantities described earlier w is the window size used for the Gaussian blurring. Overall we establish that without improvements in the input bandwidth, the power of multicore processing cannot be used efficiently for SIFT
    corecore