Search CORE

89 research outputs found

Profiling I/O interrupts in modern architectures

Author: Davis Al
Schaelicke Lambert
Publication venue: University of Utah
Publication date: 01/01/1999
Field of study

Journal ArticleAs applications grow increasingly communication-oriented, interrupt performance quickly becomes a crucial component of high performance I/O system design. At the same time, accurately measuring interrupt handler performance is difficult with the traditional simulation, instrumentation, or statistical sampling approaches. One o f the most important components o f interrupt performance is cache behavior. This paper presents a portable method for measuring the cache effects o f I/O interrupt handling using native hardware performance counters. To provide a portability stress test, the method is demonstrated on two commercial platforms with different architectures, the SGI Origin 200 and the Sun LJltra-1. This case study uses the methodology to measure the overhead of the two most common forms o f interrupt traffic: disk and network interrupts. The study demonstrates that the method works well and is reasonably robust. In addition, the results show that disk interrupts behave similar on both platforms, while differences in OS organization cause network interrupts to behave very differently. Furthermore, network interrupts exhibit significantly larger cache footprints.

The University of Utah: J. Willard Marriott Digital Library

Hierarchical clustered register file organization for VLIW processors

Author: Ayguadé Parra Eduard
Llosa Espuny José Francisco
Valero Cortés Mateo
Zalamea León Francisco Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Scalable Parallel Computers for Real-Time Signal Processing

Author: Hwang K
Xu Z
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1996
Field of study

We assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing. We review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWs). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing. We evaluate the IBM SP2 at MHPCC, the Intel Paragon at SDSC, the Gray T3D at Gray Eagan Center, and the Gray T3E and ASCI TeraFLOP system proposed by Intel. On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools and communication libraries are also introducedpublished_or_final_versio

HKU Scholars Hub

Widening resources: a cost-effective technique for aggressive ILP architectures

Author: Ayguadé Parra Eduard
Llosa Espuny José Francisco
López Álvarez David
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The inherent instruction-level parallelism (ILP) of current applications (specially those based on floating point computations) has driven hardware designers and compilers writers to investigate aggressive techniques for exploiting program parallelism at the lowest level. To execute more operations per cycle, many processors are designed with growing degrees of resource replication (buses and functional units). However the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. An alternative to resource replication is resource widening, that has also been used in some recent designs, in which the width of the resources is increased. In this paper we evaluate a broad set of design alternatives that combine both replication and widening. For each alternative we perform an estimation of the ILP limits (including the impact of spill code for several register file configurations) and the cost in terms of area and access time of the register file. We also perform a technological projection for the next 10 years in order to foresee the possible implementable alternatives. From this study we conclude that if the cost is taken into account, the best performance is obtained when combining certain degrees of replication and widening in the hardware resources. The results have been obtained from a large number of inner loops from numerical programs scheduled for VLIW architecturesPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Speculative dynamic vectorization

Author: González Colás Antonio María
Pajuelo González Manuel Alejandro
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Traditional vector architectures have shown to be very effective for regular codes where the compiler can detect data-level parallelism. However, this SIMD parallelism is also present in irregular or pointer-rich codes, for which the compiler is quite limited to discover it. In this paper we propose a microarchitecture extension in order to exploit SIMD parallelism in a speculative way. The idea is to predict when certain operations are likely to be vectorizable, based on some previous history information. In this case, these scalar instructions are executed in a vector mode. These vector instructions operate on several elements (vector operands) that are anticipated to be their input operands and produce a number of outputs that are stored on a vector register in order to be used by further instructions. Verification of the correctness of the applied vectorization eventually changes the status of a given vector element from speculative to non-speculative, or alternatively, generates a recovery action in case of misspeculation. The proposed microarchitecture extension applied to a 4-way issue superscalar processor with one wide bus is 19% faster than the,same processor with 4 scalar buses to Ll data cache. This speed up is due basically to 1) the reduction in number of memory accesses, 15% for SpecInt and 20% for SpecFP, 2) the transformation of scalar arithmetic instructions into their vector counterpart, 28% for SpecInt and 23% for SpecFP, and 3) the exploitation of control independence for mispredicted branches.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Neue Prozessor-Architekturen für Desktop-PC

Author: Däne Bernd
Publication venue
Publication date: 03/11/2006
Field of study

Der Bericht gibt einen Einblick in Entwicklungsstand und Entwicklungstendenzen von PC-Prozessoren. Dabei werden technische Prinzipien wie Cache, Pipeline, Superskalar-Architektur und Out-of-Order-Execution behandelt und an Beispielen beleuchtet. Die Anwendung dieser Prinzipien bei den Prozessorfamilien führender Hersteller wird gezeigt

Digitale Bibliothek Thüringen

Recommended from our members

Theory and practice of classical matrix-matrix multiplication for hierarchical memory architectures

Author: Smith Tyler Michael
Publication venue
Publication date: 31/01/2018
Field of study

Matrix-matrix multiplication is perhaps the most important operation used as a basic building block in dense linear algebra. A computer with a hierarchical memory architectures has memory that is organized in layers, with small and fast memories close to the processor, and big and slow memories further away from it. Classical matrix-matrix multiplication is an operation particularly suited for such architectures, as it exhibits a large degree of data reuse, so expensive data movements can be amortized over a lot of computation. This dissertation advances the theory of how to optimally reuse data during matrix-matrix multiplication on hierarchical memory architectures, and it uses this understanding to develop new practical algorithms for matrix-matrix multiplication that exhibit improved properties related to data movement.Computer Science

Texas ScholarWorks

Data prefetching using hardware register value predictable table.

Author
Publication venue: Department of Cultural and Religious Studies, The Chinese University of Hong Kong
Publication date: 01/01/1996
Field of study

by Chin-Ming, Cheung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 95-97).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Objective --- p.3Chapter 1.3 --- Organization of the dissertation --- p.4Chapter 2 --- Related Works --- p.6Chapter 2.1 --- Previous Cache Works --- p.6Chapter 2.2 --- Data Prefetching Techniques --- p.7Chapter 2.2.1 --- Hardware Vs Software Assisted --- p.7Chapter 2.2.2 --- Non-selective Vs Highly Selective --- p.8Chapter 2.2.3 --- Summary on Previous Data Prefetching Schemes --- p.12Chapter 3 --- Program Data Mapping --- p.13Chapter 3.1 --- Regular and Irregular Data Access --- p.13Chapter 3.2 --- Propagation of Data Access Regularity --- p.16Chapter 3.2.1 --- Data Access Regularity in High Level Program --- p.17Chapter 3.2.2 --- Data Access Regularity in Machine Code --- p.18Chapter 3.2.3 --- Data Access Regularity in Memory Address Sequence --- p.20Chapter 3.2.4 --- Implication --- p.21Chapter 4 --- Register Value Prediction Table (RVPT) --- p.22Chapter 4.1 --- Predictability of Register Values --- p.23Chapter 4.2 --- Register Value Prediction Table --- p.26Chapter 4.3 --- Control Scheme of RVPT --- p.29Chapter 4.3.1 --- Details of RVPT Mechanism --- p.29Chapter 4.3.2 --- Explanation of the Register Prediction Mechanism --- p.32Chapter 4.4 --- Examples of RVPT --- p.35Chapter 4.4.1 --- Linear Array Example --- p.35Chapter 4.4.2 --- Linked List Example --- p.36Chapter 5 --- Program Register Dependency --- p.39Chapter 5.1 --- Register Dependency --- p.40Chapter 5.2 --- Generalized Concept of Register --- p.44Chapter 5.2.1 --- Cyclic Dependent Register(CDR) --- p.44Chapter 5.2.2 --- Acyclic Dependent Register(ADR) --- p.46Chapter 5.3 --- Program Register Overview --- p.47Chapter 6 --- Generalized RVPT Model --- p.49Chapter 6.1 --- Level N RVPT Model --- p.49Chapter 6.1.1 --- Identification of Level N CDR --- p.51Chapter 6.1.2 --- Recording CDR instructions of Level N CDR --- p.53Chapter 6.1.3 --- Prediction of Level N CDR --- p.55Chapter 6.2 --- Level 2 Register Value Prediction Table --- p.55Chapter 6.2.1 --- Level 2 RVPT Structure --- p.56Chapter 6.2.2 --- Identification of Level 2 CDR --- p.58Chapter 6.2.3 --- Control Scheme of Level 2 RVPT --- p.59Chapter 6.2.4 --- Example of Index Array --- p.63Chapter 7 --- Performance Evaluation --- p.66Chapter 7.1 --- Evaluation Methodology --- p.66Chapter 7.1.1 --- Trace-Drive Simulation --- p.66Chapter 7.1.2 --- Architectural Method --- p.68Chapter 7.1.3 --- Benchmarks and Metrics --- p.70Chapter 7.2 --- General Result --- p.75Chapter 7.2.1 --- Constant Stride or Regular Data Access Applications --- p.77Chapter 7.2.2 --- Non-constant Stride or Irregular Data Access Applications --- p.79Chapter 7.3 --- Effect of Design Variations --- p.80Chapter 7.3.1 --- Effect of Cache Size --- p.81Chapter 7.3.2 --- Effect of Block Size --- p.83Chapter 7.3.3 --- Effect of Set Associativity --- p.86Chapter 7.4 --- Summary --- p.87Chapter 8 --- Conclusion and Future Research --- p.88Chapter 8.1 --- Conclusion --- p.88Chapter 8.2 --- Future Research --- p.90Bibliography --- p.95Appendix --- p.98Chapter A --- MCPI vs. cache size --- p.98Chapter B --- MCPI Reduction Percentage Vs cache size --- p.102Chapter C --- MCPI vs. block size --- p.106Chapter D --- MCPI Reduction Percentage Vs block size --- p.110Chapter E --- MCPI vs. set-associativity --- p.114Chapter F --- MCPI Reduction Percentage Vs set-associativity --- p.11

CUHK Digital Repository