32 research outputs found
Appearances of the Birthday Paradox in High Performance Computing
We give an elementary statistical analysis of two High Performance Computing issues, processor cache mapping and network port mapping. In both cases we find that, as in the birthday paradox, random assignment leads to more frequent coincidences than one expects a priori. Since these correspond to contention for limited resources, this phenomenon has important consequences for performance.Texas Advanced Computing Center (TACC
Recommended from our members
Memory Bandwidth and System Balance in HPC Systems
Presentation slides from an Invited Talk at the Supercomputing 2016 (sc16.supercomputing.org) conference.The “Attack of the Killer Micros” began approximately 25 years ago as microprocessor-based systems began to compete with supercomputers (in some application areas). It became clear that peak arithmetic rate was not an adequate measure of system performance for many applications, so in 1991 Dr. McCalpin introduced the STREAM Benchmark to estimate “sustained memory bandwidth” as an alternative performance metric.
STREAM apparently embodied a good compromise between generality and ease of use and quickly became the “de facto” standard for measuring and reporting sustained memory bandwidth in High Performance Computing systems.
Since the initial “attack”, Moore’s Law and Dennard Scaling have led to astounding increases in the computational capabilities of microprocessors. The technology behind memory subsystems has not experienced comparable performance improvements, causing sustained memory bandwidth to fall behind.
This talk reviews the history of the changing balances between computation, memory latency, and memory bandwidth in deployed HPC systems, then discusses how the underlying technology changes led to these market shifts. Key metrics are the exponentially increasing relative performance cost of memory accesses and the massive increases in concurrency that are required to obtain increased memory throughput.
New technologies (such as stacked DRAM) allow more pin bandwidth per package, but do not address the architectural issues that make high memory bandwidth expensive to support. Potential disruptive technologies include near-memory-processing and application-specific system implementations, but all foreseeable approaches fail to provide software compatibility with current architectures.
Due to the absence of practical alternatives, in the near term we can expect systems to become increasingly complex and unbalanced, with constant or slightly increasing per-node prices. These systems will deliver the best rate of performance improvement for workloads with increasingly high compute intensity and increasing available concurrency.National Science Foundation Award 1663578Texas Advanced Computing Center (TACC
Recommended from our members
Trends in System Cost and Performance Balances and Implications for the Future of HPC
Presentation slides from an Invited Talk at Co-HPC'15: 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing (associated with the Supercomputing conference SC16.supercomputing.org).For the last decade, HPC systems have been dominated by clusters of two-socket commodity x86 servers, typically equipped with a non-commodity high-performance interconnect. Trends in lifecycle costs and prices, hardware technology, several measures of CPU and memory performance, and application performance characteristics are presented using several non-traditional perspectives. The evolution of the various "balances" of the systems over time is discussed --- both in the context of the interaction of application performance with the changing hardware, and in the context of the broader economic environment. Several serious obstacles to maintaining previous performance growth rates are identified and discussed, and it is argued that these are better viewed as architectural and market issues, rather than as fundamental technology issues. It is argued that overcoming these obstacles will require a fundamentally different approach to hardware architecture and programming languages, as well as to system configuration, deployment, and allocation strategies.National Science Foundation Award 1663578Texas Advanced Computing Center (TACC
Exhumation of the central Wasatch Mountains, Utah: 2. Thermokinematic model of exhumation, erosion, and thermochronometer interpretation
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/95279/1/jgrb13381.pd
Recommended from our members
Observations on Core Numbering and "Core ID's" in Intel Processors
This report describes and analyzes the patterns of logical processor distribution (across sockets) and the patterns of the "core ID" numbers provided by the hardware in recent and current Intel-processor-based systems at the Texas Advanced Computing Center.National Science Foundation awards 1663578 and 1854828Texas Advanced Computing Center (TACC
Recommended from our members
Mapping Core and L3 Slice Numbering to Die Locations in Intel Xeon Scalable Processors
A methodology for mapping from user-visible core and L3 slice numbers to locations on the processor die is presented, along with results obtained from systems with Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”) at the Texas Advanced Computing Center. The current methodology is based on the data traffic counters in the 2-D mesh on-chip-network, with the measurements revealing unexpected and counterintuitive transformations of the meanings of “left” and “right” in different regions of the chip. Results show that the numbering of L3 slices is consistent across processor models, while the numbering of cores displays a small number of different patterns, depending on processor model and system vendor.National Science Foundation awards 1663578 and 1854828Texas Advanced Computing Center (TACC
Recommended from our members
Mapping Core and L3 Slice Numbering to Die Location in Intel Xeon Scalable Processors
A methodology for mapping from user-visible core and L3 slice numbers to locations on the processor die is presented, along with results obtained from systems with Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”) at the Texas Advanced Computing Center. The current methodology is based on the data traffic counters in the 2-D mesh on-chip-network, with the measurements revealing unexpected and counterintuitive transformations of the meanings of “left” and “right” in different regions of the chip. Results show that the numbering of L3 slices is consistent across processor models, while the numbering of cores displays a small number of different patterns, depending on processor model and system vendor.National Science Foundation Awards 1663578 and 1854828Texas Advanced Computing Center (TACC
Recommended from our members
Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors
Intel's second-generation Xeon Phi (Knights Landing) and Xeon Scalable Processor ("Skylake Xeon") are both based on a new 2-D mesh architecture with significant changes to the cache coherence protocol. This talk will review some of the most important new features of the coherence protocol (such as "snoop filters", "memory directories", and non-inclusive L3 caches) from a performance analysis perspective. For both of these processor families, the mapping from user-visible information (such as core numbers) to spatial location on the mesh is both undocumented and obscured by low-level renumbering. A methodology is presented that uses microbenchmarks and performance counters to invert this renumbering. This allows the display of spatially relevant performance counter data (such as mesh traffic) in a topologically accurate two-dimensional view. Applying these visualizations to simple benchmark results provides immediate intuitive insights into the flow of data in these systems, and reveals ways in which the new cache coherence protocols modify these flows.National Science Foundation Awards 1663578 and 1854828Texas Advanced Computing Center (TACC
Recommended from our members
Mapping Core, CHA, and Memory Controller Numbers to Die Locations in Intel Xeon Phi x200 ("Knights Landing", "KNL") Processors
A methodology for mapping from user-visible core, CHA, and memory controller numbers to locations on the processor die is presented, along with results obtained from systems with Intel Xeon Phi x200 (“Knights Landing”, “KNL”) processors at the Texas Advanced Computing Center. The current methodology is based on the data traffic counters in the 2-D mesh on-chip-network, with the measurements revealing unexpected and counterintuitive transformations of the meanings of “left”, “right”. “up”, and “down” in different regions of the chip. For the systems tested, all CHAs were active and had the same mapping of CHA number to physical location on the die. In contrast to our observations with Xeon Scalable Processors, the x2APIC IDs of the cores in Xeon Phi x200 are not mapped independently of the CHAs – the x2APIC ID of any enabled core contains the CHA number in bits [8:3]. Disabled cores are identified by x2APIC values not seen in any active core. In all cases tested, Logical Processor numbers were assigned to the active physical cores using a simple monotonic mapping.National Science Foundation Awards 1663578 and 1854828.Texas Advanced Computing Center (TACC
Recommended from our members
Address Hashing in Intel Processors
To implement a distributed shared last-level cache, addresses must be distributed across the set of cache "slices" in a way that maintains an acceptable degree of uniformity for many common access patterns. This presentation reviews the properties of the address hashes used in Intel Xeon Phi x200 and Intel Xeon Scalable Processors as determined by microbenchmark experimentation. Several cases of conflicts are discussed, along with possible workarounds.National Science Foundation Awards 1663578 and 1854828Texas Advanced Computing Center (TACC