31 research outputs found

    Weevaluateoursoftwareapproach,andthencompare

    No full text
    Inthispaperweaddresstheimportantproblemofinstruc-tionfetchforfuturewideissuesuperscalarprocessors.Our approachfocusesonunderstandingtheinteractionbetween softwareandhardwaretechniquestargetinganincreasein theinstructionfetchbandwidth.Thatistheobjective,for instance,oftheHardwareTraceCache(HTC). Wedesignaprolebasedcodereorderingtechniquewhich targetsamaximizationofthesequentialityofinstructions, whilestilltryingtominimizeinstructioncachemisses.We calloursoftwareapproach,SoftwareTraceCache(STC)

    The effect of code reordering on branch prediction

    Get PDF
    Branch prediction accuracy is a very important factor for superscalar processor performance. The ability to pre-dict the outcome of a branch allows the processor to effec-tively use a large instruction window, and extract a larger amount of Instruction Level Parallelism (ILP). In this paper we will examine the effect of code layout op-timizations on branch prediction accuracy and final proces-sor performance. These code reordering techniques align branches so that they tend to be not taken, achieving bet-ter instruction cache performance and increasing the fetch bandwidth. Here we focus on how these optimizations affect both static and dynamic branch prediction. Code reordering mainly increases the number of not tak-en branches, which benefits simple static predictors, whic

    Branch Prediction Using Profile Data

    No full text
    Branch prediction accuracy is a very important factor for superscalar processor performance. It is the ability to predict the outcome of a branch which allows the processor to effectively use a large instruction window, and extract a larger amount of ILP

    Generating time-varying road network data using sparse trajectories

    Get PDF
    While research on time-varying graphs has attracted recent attention, the research community has limited or no access to real datasets to develop effective algorithms and systems. Using noisy and sparse GPS traces from vehicles, we develop a time-varying road network data set where edge weights differ over time. We present our methodology and share this dataset, along with a graph manipulation tool. We estimate the traffic conditions using the sparse GPS data available by characterizing the sparsity issues and assessing the properties of travel sequence data frequency domain. We develop interpolation methods to complete the sparse data into a complete graph dataset with realistic time-varying edge values. We evaluate the performance of time-varying and static shortest path solutions over the generated dynamic road network. The shortest paths using the dynamic graph produce very different results than the static version. We provide an independent Java API and a graph database to analyze and manipulate the generated time-varying graph data easily, not requiring any knowledge about the inners of the graph database system. We expect our solution to support researchers to pursue problems of time-varying graphs in terms of theoretical, algorithmic, and systems aspects. The data and Java API are available at: http://elif.eser.bilkent.edu.tr/roadnetwork

    Instruction Fetch Architectures and Code Layout Optimizations

    No full text
    The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described. Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms. Keywords—Branch prediction, code layout, instruction fetch, trace cache. I

    The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors

    No full text
    Although cache-coherent shared-memory multiprocessors are often used to run commercial workloads, little work has been done to characterize how well these machines support such workloads. In particular, we do not have much insight into the demands of commercial workloads on the memory subsystem of these machines. In this paper, we analyze in detail the memory access patterns of several queries that are representative of Decision Support System (DSS) databases. Our analysis shows that the memory use of queries differs largely depending on how the queries access the database data, namely via indices or by sequentially scanning the records. The former queries, which we call Index queries, suffer most of their shared-data misses on indices and on lockrelated metadata structures. The latter queries, which we call Sequential queries, suffer most of their shared-data misses on the database records as they are scanned. An analysis of the data locality in the queries shows that both Index and ..

    Solution of Strictly Diagonal Dominant Tridiagonal Systems on Vector Computers

    No full text
    . In this paper we propose the Overlapped Partitions Method (OPM) which is a new parallel solver for strictly diagonal dominant banded systems of equations. OPM is studied here for the case of tridiagonal systems of equations. OPM is compared with early terminated versions of the Cyclic Reduction and the Tricyclic Reduction methods on the vector computer Convex C-3480. Cyclic Reduction is a tridiagonal solver broadly used on vector and parallel computers. Tricyclic Reduction is a fairly new tridiagonal solver very well adapted to the architecture of vector computers. For the comparison of the three methods we build models of the algorithms and use these models to compare their execution times on the Convex C-3480. In the paper we also propose a new criterion for the early termination of TR. Key words. tridiagonal systems, vector processors, parallel numerical algorithms. AMS (MOS) subject classifications. 65F05, 65W05. Abreviated title: Tridiagonal systems on vector computers. 1. In..

    Bounds for the Error of Some Parallel Bidiagonal Solvers for Strictly Diagonal Dominant Systems

    No full text
    In this paper, the numerical aspects of some methods for the solution of bidiagonal systems are analyzed. We suppose that the systems are strictly diagonal dominant. The methods analyzed are R-Cyclic Reduction, the Divide and Conquer algorithm and the Overlapped Partitions Method. In order to give completeness to the paper, we also describe the methods analyzed here. For the case of R-Cyclic Reduction and the Divide and Conquer algorithm, a unified early termination criterion is given. For the case of the Overlapped Partitions Method, a criterion for the amount of overlapping is proposed. Key words: bidiagonal systems solvers, strict diagonal dominance, parallel numerical algorithms. 1 Introduction The solution of bidiagonal systems of equations appears in different applications. For instance, when several tridiagonal systems have to be solved repeated times with the same matrix. In this case, it is convenient to find an LU decomposition of the matrix. Two bidiagonal systems have to..

    Vectorized Algorithms for Natural Cubic Spline and B-Spline Curve Fitting

    No full text
    In this paper we deal with the solution of the almost Toeplitz tridiagonal systems that arise from the problem of curve fitting by Natural Cubic Splines and B-Splines. We propose the TJ decomposition that gives rise to a method which is more accurate and faster than other previously proposed methods as we prove along the work. For the solution of the recurrences that arise from the decomposition, we propose a specialization of the Overlapped Partitions Method (OPM). We show that OPM compares favourably in the context of the problem to the classic Divide and Conquer and R-Cyclic Reduction on the Convex C-3480 supercomputer
    corecore