Search CORE

2,598 research outputs found

Empowering a helper cluster through data-width aware instruction selection policies

Author: Ergin Oguz
González Colás Antonio María
Unsal Osman Sabri
Vera Rivera Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

Elements of latent learning in a maze environment

Author: Granger Richard H.
McNulty Dale
Publication venue: eScholarship, University of California
Publication date: 03/01/1985
Field of study

A general purpose learning program is described which demonstrates a latent learning ability by operating at two separate goal pursuit levels. At one level are the constant, implicit goals associated with the system's memory management mechanisms. At the higher level are the dynamic, explicit behavioral goals which the implicit goals enable by manipulating memory representations to conform to the external surroundings. The program is shown to negotiate a simulated maze environment by the step-wise refinement of its latently learned experiences

eScholarship - University of California

Fetch unit design for scalable simultaneous multithreading (ScSMT)

Author: Luque Fadón Emilio
Moure Juan Carlos
Rexachs del Rosario Dolores
Publication venue
Publication date: 29/03/2004
Field of study

Continuous IC process enhancements make possible to integrate on a single chip the re-sources required for simultaneously executing multiple control flows or threads, exploiting different levels of thread-level parallelism: application-, function-, and loop-level. Scalable simultaneous multi-threading combines static and dynamic mechanisms to assemble a complexity-effective design that provides high instruction per cycle rates without sacrificing cycle time nor single-thread performance. This paper addresses the design of the fetch unit for a high-performance, scalable, simultaneous multithreaded processor. We present the detailed microarchitecture of a clustered and reconfigurable fetch unit based on an existing single-thread fetch unit. In order to minimize the occurrence of fetch hazards, the fetch unit dynamically adapts to the available thread-level parallelism and to the fetch characteristics of the active threads, working as a single shared unit or as two separate clusters. It combines static and dynamic methods in a complexity-efficient way. The design is supported by a simulation- based analysis of different instruction cache and branch target buffer configurations on the context of a multithreaded execution workload. Average reductions on the miss rates between 30% and 60% and peak reductions greater than 200% are obtained.Facultad de Informátic

Servicio de Difusión de la Creación Intelectual

Fetch unit design for scalable simultaneous multithreading (ScSMT)

Author: Luque Fadón Emilio
Moure Juan Carlos
Rexachs del Rosario Dolores
Publication venue
Publication date: 01/01/2001
Field of study

Recommended from our members

Instruction history management for high-performance microprocessors

Author: Bhargava Ravindra Nath
Publication venue
Publication date: 01/01/2003
Field of study

textHistory-driven dynamic optimization is an important factor in improving instruction throughput in future high-performance microprocessors. Historybased techniques have the ability to improve instruction-level parallelism by breaking program dependencies, eliminating long-latency microarchitecture operations, and improving prioritization within the microarchitecture. However, a combination of factors, such as wider issue widths, smaller transistors, larger die area, and increasing clock frequency, has led to microprocessors that are sensitive to both wire delays and energy consumption. In this environment, the global structures and long-distance communications that characterize current history data management are limiting instruction throughput. This dissertation proposes the ScatterFlow Framework for Instruction History Management. Execution history management tasks, such as history data storage, access, distribution, collection, and modification, are partitioned and dispersed throughout the instruction execution pipeline. History data packets are then associated with active instructions and flow with the instructions as they execute, encountering the history management tasks along the way. Between dynamic instances of the instructions, the history data packets reside in trace-based history storage that is synchronized with the instruction trace cache. Compared to traditional history data management, this ScatterFlow method improves instruction coverage, increases history data access bandwidth, shortens communication distances, improves history data accuracy in many cases, and decreases the effective history data access time. A comparison of general history management effectiveness between the ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow Framework provides superior history maturity and instruction coverage. The unique properties that arise due to trace-based history storage and partitioned history management are analyzed, and novel design enhancements are presented to increase the usefulness of instruction history data within the ScatterFlow Framework. To demonstrate the potential of the proposed framework, specific dynamic optimization techniques are implemented using the ScatterFlow Framework. These illustrative examples combine the history capture advantages with the access latency improvements while exhibiting desirable dynamic energy consumption properties. Compared to a traditional table-based predictor, performing ScatterFlow value prediction improves execution time and reduces dynamic energy consumption. In other detailed examples, ScatterFlowenabled cluster assignment demonstrates improved execution time over previous cluster assignment schemes, and ScatterFlow instruction-level profiling detects more useful execution traits than traditional fixed-size and infinite-size hardware tables.Electrical and Computer Engineerin

Texas ScholarWorks

A Survey of Techniques for Architecting TLBs

Author: Mittal Sparsh
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

Research Archive of Indian Institute of Technology Hyderabad

Design of a distributed memory unit for clustered microarchitectures

Author: Bieschewski Stefan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2013
Field of study

Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Dynamically managing the communication-parallelism trade-off in future clustered processors

Author: Balasubramonian Rajeev
Dwarkadas Sandhya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

Journal ArticleClustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instruction-level parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application's needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11% performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communication-bound processors of the future

The University of Utah: J. Willard Marriott Digital Library

My bitterness is deeper than the ocean : understanding internalized stigma from the perspectives of persons with schizophrenia and their family caregivers.

Author: Frasso Rosemary, PhD, MSc, CPH
Kong Dexia
Tu Lufei
Wong Yin-Ling Irene
Publication venue: Jefferson Digital Commons
Publication date: 03/04/2018
Field of study

Background: It is estimated that 8 million of the Chinese adult population had a diagnosis of schizophrenia. Stigma associated with mental illness, which is pervasive in the Chinese cultural context, impacts both persons with schizophrenia and their family caregivers. However, a review of the literature found a dearth of research that explored internalized stigma from the perspectives of both patients and their caregivers. Methods: We integrated data from standardized scales and narratives from semi-structured interviews obtained from eight family-dyads. Interview narratives about stigma were analyzed using directed content analysis and compared with responses from Chinese versions of the Internalized Stigma of Mental Illness Scale and Affiliated Stigma Scale. Scores from the two scales and number of text fragments were compared to identify consistency of responses using the two methods. Profiles from three family-dyads were analyzed to highlight the interactive aspect of stigma in a dyadic relationship. Results: Our analyses suggested that persons with schizophrenia and their caregivers both internalized negative valuation from their social networks and reduced engagement in the community. Participants with schizophrenia expressed a sense of shame and inferiority, spoke about being a burden to their family, and expressed self-disappointment as a result of having a psychiatric diagnosis. Caregivers expressed high level of emotional distress because of mental illness in the family. Family dyads varied in the extent that internalized stigma were experienced by patients and caregivers. Conclusions: Family plays a central role in caring for persons with mental illness in China. Given the increasingly community-based nature of mental health services delivery, understanding internalized stigma as a family unit is important to guide the development of cultural-informed treatments. This pilot study provides a method that can be used to collect data that take into consideration the cultural nuances of Chinese societies

Jefferson Digital Commons