3 research outputs found

    Synchronization-Point Driven Resource Management in Chip Multiprocessors.

    Get PDF
    With the proliferation of Chip Multiprocessors (CMPs), shared memory multi-threaded programs are expanding fast in every application domain. These programs exhibit execution characteristics that go beyond those observed in single-threaded programs, mainly due to data sharing and synchronization. To ensure that next generation CMPs will perform well on such anticipated workloads, it is vital to understand how these programs and architectures interact, and exploit the unique opportunities presented. This thesis examines the time-varying execution characteristics of the shared memory workloads in conjunction to the synchronization points that exist in the programs. The main hypothesis is that the type, the position, and the repetitive execution of synchronization constructs can be exploited to unfold important execution phases and enable new optimization opportunities. The research provides a simple application-driven approach for predicting the program behavior and effectively driving dynamic performance optimization and resource management actions in future CMPs. In the first part of this thesis, I show how synchronization points relate to various program-wide periodic behaviors. Based on the observations, I develop a framework where user-level synchronization primitives are exposed to the hardware and monitored to detect program phases and guide dynamic adaptation. Through workload-driven evaluation, I demonstrate the effectiveness of the framework in improving the performance/power in on-chip interconnects. The second part of the thesis explores in depth the inter-thread communication behaviors. I show that although synchronization points under the shared memory model do not expose any communication details, they indicate well the points where coherence communication patterns change or repeat. By leveraging this property, I design a synchronization-point-based coherence predictor that uncovers communication patterns with high accuracy, while consuming significantly less hardware resources compared to existing predictors. In the last part, I investigate the underlying reasons causing threads to wait in synchronization points, wasting resources. I show that these reasons can vary even across different programs phases, and existing critical-path predictors can render ineffective under certain conditions. I then present a new scheme that improves predictability by incorporating history information from previous points. The new design is robust and can amortize the run-time imbalances to improve the system's performance and/or energy

    Reconnaissance d'opérations d'algèbre linéaire dans un programme polyédrique

    Get PDF
    Writing a code which uses an architecture at its full capability has become an increasingly difficult problem over the last years. For some key operations, a dedicated accelerator or a finely tuned implementation exists and delivers the best performance. Thus, when compiling a code, identifying these operations and issuing calls to their high-performance implementation is attractive. In this dissertation, we focus on the problem of detection of these operations. We propose a framework which detects linear algebra subcomputations within a polyhedral program. The main idea of this framework is to partition the computation in order to isolate different subcomputations in a regular manner, then we consider each portion of the computation and try to recognize it as a combination of linear algebra operations.We perform the partitioning of the computation by using a program transformation called monoparametric tiling. This transformation partitions the computation into blocks, whose shape is some homothetic scaling of a fixed-size partitioning. We show that the tiled program remains polyhedral while allowing a limited amount of parametrization: a single size parameter. This is an improvement compared to the previous work on tiling, that forced us to choose between these two properties.Then, in order to recognize computations, we introduce a template recognition algorithm. This template recognition algorithm is built on a state-of-the-art program equivalence algorithm. We also propose several extensions in order to manage some semantic properties.Finally, we combine these two previous contributions into a framework which detects linear algebra subcomputations. A part of this framework is a library of template, based on the BLAS specification. We demonstrate our framework on several applications.Durant ces dernières années, Il est de plus en plus compliqué d'écrire du code qui utilise une architecture au mieux de ses capacités. Certaines opérations clefs ont soit un accélérateur dédié, ou admettent une implémentation finement optimisée qui délivre les meilleurs performances. Ainsi, il est intéressant d'identifier ces opérations pendant la compilation d'un programme, et de faire appel à une implémentation optimisée.Nous nous intéressons dans cette thèse au problème de détection de ces opérations. Nous proposons un procédé qui détecte des sous-calculs correspondant à des opérations d'algèbre linéaire à l'intérieur de programmes polyédriques. L'idée principale de ce procédé est de découper le programme en sous-calculs isolés, et essayer de reconnaître chaque sous-calculs comme une combinaison d'opérateurs d'algèbre linéaire.Le découpage du calcul est effectué en utilisant une transformation de programme appelée tuilage monoparamétrique. Cette transformation partitionne le calcul en tuiles dont la forme est un agrandissement paramétrique d'une tuile de taille constante. Nous montrons que le programme tuilé reste polyédrique tout en permettant une paramétrisation limitée des tailles de tuile. Les travaux précédents sur le tuilage nous forçaient à choisir l'une de ces deux propriétés.Ensuite, afin d'identifier les opérateurs, nous introduisons un algorithme de reconnaissance de template, qui est une extension d'un algorithme d'équivalence de programme. Nous proposons plusieurs extensions afin de tenir compte des propriétés sémantiques communément rencontrées en algèbre linéaire.Enfin, nous combinons les deux contributions précédentes en un procédé qui détecte les sous-calculs correspondant à des opérateurs d'algèbre linéaire. Une de ses composantes est une librairie de template, inspirée de la spécification BLAS. Nous démontrons l'efficacité de notre procédé sur plusieurs applications

    Using Tracing To Enhance Data Cache Performance in CPUs: The creation of a Trace-Assisted Cache to increase cache hits and decrease runtime

    Get PDF
    The processor-memory gap is widening every year with no prospect of reprieve. More and more latency is being added to program runtimes as memory cannot satisfy the demands of CPUs quickly enough. In the past, this has been alleviated through caches of increasing complexity or techniques like prefetching, to give the illusion of faster memory. However, these techniques have drawbacks because they are reactive or rely on incomplete information. In general, this leads to large amounts of latency in programs due to processor stalls. It is our contention that through tracing a program's data accesses and feeding this information back to the cache, overall program runtime can be reduced. This is achieved through a new piece of hardware called a Trace-Assisted Cache (TAC). This uses traces to gain foreknowledge of the memory requests the processor is likely to make, allowing them to be actioned before the processor requests the data, overlapping memory and computation instructions. Comparing the TAC against a standard CPU without a cache, we see improvements in runtimes of up to 65%. However, we see degraded performance of around 8% on average when compared to Set-Associative and Direct-Mapped caches. This is because improvements are swamped by high overheads and synchronisation times between components. We also see that benchmarks that exhibit several qualities: a balance of computation and memory instructions and keeping data well spread out in memory fare better using TAC than other benchmarks on the same hardware. Overall this demonstrates that whilst there is potential to reduce runtime via increasing the agency of the cache through Trace Assistance, it requires a highly efficient implementation to be competitive otherwise any potential gains are negated by the increase in overheads
    corecore