2,598 research outputs found
Empowering a helper cluster through data-width aware instruction selection policies
Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version
Recommended from our members
Elements of latent learning in a maze environment
A general purpose learning program is described which demonstrates a latent learning ability by operating at two separate goal pursuit levels. At one level are the constant, implicit goals associated with the system's memory management mechanisms. At the higher level are the dynamic, explicit behavioral goals which the implicit goals enable by manipulating memory representations to conform to the external surroundings. The program is shown to negotiate a simulated maze environment by the step-wise refinement of its latently learned experiences
Fetch unit design for scalable simultaneous multithreading (ScSMT)
Continuous IC process enhancements make possible to integrate on a single chip the re-sources required for simultaneously executing multiple control flows or threads, exploiting different levels of thread-level parallelism: application-, function-, and loop-level. Scalable simultaneous multi-threading combines static and dynamic mechanisms to assemble a complexity-effective design that provides high instruction per cycle rates without sacrificing cycle time nor single-thread performance. This paper addresses the design of the fetch unit for a high-performance, scalable, simultaneous multithreaded processor. We present the detailed microarchitecture of a clustered and reconfigurable fetch unit based on an existing single-thread fetch unit. In order to minimize the occurrence of fetch hazards, the fetch unit dynamically adapts to the available thread-level parallelism and to the fetch characteristics of the active threads, working as a single shared unit or as two separate clusters. It combines static and dynamic methods in a complexity-efficient way. The design is supported by a simulation- based analysis of different instruction cache and branch target buffer configurations on the context of a multithreaded execution workload. Average reductions on the miss rates between 30% and 60% and peak reductions greater than 200% are obtained.Facultad de Informátic
Fetch unit design for scalable simultaneous multithreading (ScSMT)
Continuous IC process enhancements make possible to integrate on a single chip the re-sources required for simultaneously executing multiple control flows or threads, exploiting different levels of thread-level parallelism: application-, function-, and loop-level. Scalable simultaneous multi-threading combines static and dynamic mechanisms to assemble a complexity-effective design that provides high instruction per cycle rates without sacrificing cycle time nor single-thread performance. This paper addresses the design of the fetch unit for a high-performance, scalable, simultaneous multithreaded processor. We present the detailed microarchitecture of a clustered and reconfigurable fetch unit based on an existing single-thread fetch unit. In order to minimize the occurrence of fetch hazards, the fetch unit dynamically adapts to the available thread-level parallelism and to the fetch characteristics of the active threads, working as a single shared unit or as two separate clusters. It combines static and dynamic methods in a complexity-efficient way. The design is supported by a simulation- based analysis of different instruction cache and branch target buffer configurations on the context of a multithreaded execution workload. Average reductions on the miss rates between 30% and 60% and peak reductions greater than 200% are obtained.Facultad de Informátic
Recommended from our members
Instruction history management for high-performance microprocessors
textHistory-driven dynamic optimization is an important factor in improving
instruction throughput in future high-performance microprocessors. Historybased
techniques have the ability to improve instruction-level parallelism by
breaking program dependencies, eliminating long-latency microarchitecture
operations, and improving prioritization within the microarchitecture. However,
a combination of factors, such as wider issue widths, smaller transistors,
larger die area, and increasing clock frequency, has led to microprocessors that
are sensitive to both wire delays and energy consumption. In this environment,
the global structures and long-distance communications that characterize current
history data management are limiting instruction throughput.
This dissertation proposes the ScatterFlow Framework for Instruction
History Management. Execution history management tasks, such as history
data storage, access, distribution, collection, and modification, are partitioned
and dispersed throughout the instruction execution pipeline. History data
packets are then associated with active instructions and flow with the instructions
as they execute, encountering the history management tasks along the
way. Between dynamic instances of the instructions, the history data packets
reside in trace-based history storage that is synchronized with the instruction
trace cache. Compared to traditional history data management, this ScatterFlow
method improves instruction coverage, increases history data access
bandwidth, shortens communication distances, improves history data accuracy
in many cases, and decreases the effective history data access time.
A comparison of general history management effectiveness between the
ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow
Framework provides superior history maturity and instruction coverage.
The unique properties that arise due to trace-based history storage and
partitioned history management are analyzed, and novel design enhancements
are presented to increase the usefulness of instruction history data within the
ScatterFlow Framework.
To demonstrate the potential of the proposed framework, specific dynamic
optimization techniques are implemented using the ScatterFlow Framework.
These illustrative examples combine the history capture advantages
with the access latency improvements while exhibiting desirable dynamic energy
consumption properties. Compared to a traditional table-based predictor,
performing ScatterFlow value prediction improves execution time and reduces
dynamic energy consumption. In other detailed examples, ScatterFlowenabled
cluster assignment demonstrates improved execution time over previous
cluster assignment schemes, and ScatterFlow instruction-level profiling
detects more useful execution traits than traditional fixed-size and infinite-size
hardware tables.Electrical and Computer Engineerin
A Survey of Techniques for Architecting TLBs
“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used
in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently
and a TLB miss is extremely costly, prudent management of TLB is important for improving performance
and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and
managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and
distinctions. We believe that this paper will be useful for chip designers, computer architects and system
engineers
Design of a distributed memory unit for clustered microarchitectures
Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall
performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources.
Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost.
This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention.
The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance
Dynamically managing the communication-parallelism trade-off in future clustered processors
Journal ArticleClustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instruction-level parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application's needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11% performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communication-bound processors of the future
My bitterness is deeper than the ocean : understanding internalized stigma from the perspectives of persons with schizophrenia and their family caregivers.
Background: It is estimated that 8 million of the Chinese adult population had a diagnosis of schizophrenia. Stigma associated with mental illness, which is pervasive in the Chinese cultural context, impacts both persons with schizophrenia and their family caregivers. However, a review of the literature found a dearth of research that explored internalized stigma from the perspectives of both patients and their caregivers.
Methods: We integrated data from standardized scales and narratives from semi-structured interviews obtained from eight family-dyads. Interview narratives about stigma were analyzed using directed content analysis and compared with responses from Chinese versions of the Internalized Stigma of Mental Illness Scale and Affiliated Stigma Scale. Scores from the two scales and number of text fragments were compared to identify consistency of responses using the two methods. Profiles from three family-dyads were analyzed to highlight the interactive aspect of stigma in a dyadic relationship.
Results: Our analyses suggested that persons with schizophrenia and their caregivers both internalized negative valuation from their social networks and reduced engagement in the community. Participants with schizophrenia expressed a sense of shame and inferiority, spoke about being a burden to their family, and expressed self-disappointment as a result of having a psychiatric diagnosis. Caregivers expressed high level of emotional distress because of mental illness in the family. Family dyads varied in the extent that internalized stigma were experienced by patients and caregivers.
Conclusions: Family plays a central role in caring for persons with mental illness in China. Given the increasingly community-based nature of mental health services delivery, understanding internalized stigma as a family unit is important to guide the development of cultural-informed treatments. This pilot study provides a method that can be used to collect data that take into consideration the cultural nuances of Chinese societies
- …