20 research outputs found

    Runtime-guided management of stacked DRAM memories in task parallel programs

    Get PDF
    Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreement 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Improving throughput of simultaneous multithreaded (SMT) processors using shareable resource signatures and hardware thread priorities

    No full text
    In this dissertation we present a methodology for predicting the best priority pair for a given co-schedule of two application threads. Our approach exploits resource-utilization information that is collected during an application thread\u27s execution in single-threaded mode. This information provides insights about the availability of resources that are shared by threads concurrently executed in simultaneous multithreading (SMT) mode for use by another co-scheduled application thread. The main contributions of this dissertation are: (1) Demonstration of the efficacy of using non-default hardware thread priority pairs to improve SMT core throughput: Using a POWER5 simulator, we show that equal (default) priorities are not the best for 82% of the 263 application trace-pairs studied. (2) The concept of a Shareable Resource Signature : this signature characterizes an application\u27s utilization of critical shareable SMT core resources during a specified execution time interval when executed in single-threaded mode. (3) A best priority pair prediction methodology: Given shareable resource signatures of an application-thread pair, we present a methodology to predict the best priority pair for the application-thread pair when co-scheduled to run in SMT mode. (4) An implementation and validation of the methodology for the IBM POWER5 processor, which shows that the following: (a) 17 of 10,000 possible signatures are sufficient to characterize 95.6% of the execution times of a set of applications that consists of 20 SPEC CPU2006 benchmarks (1 data input), three NAS NPB benchmarks (3 data inputs), and 10 PETSc KSP solvers (12 data inputs). The cgs and lsqr PETSc KSP solvers have signatures that are independent of input data, while one of three NAS NPB benchmarks (bt-mz) has a signature that is independent of the input data. (b) For 21 co-schedules of applications, each with a signature that characterizes 95% of its execution time, our validation study shows the following: (i) Predicted best priorities yield higher throughput than default priorities for all but one of the 21 co-schedules. Initial results showed that two co-schedules (462.libquantum, 437.leslie3d) and (bt-mz.A, lu-mz.A) experience a throughput loss of 7.46% and 20.05%, respectively, at predicted priorities, as compared to that achieved at default priorities. Further investigation shows that for the co-schedule (bt-mz.A, lu-mz.A) mapping and executing the co-schedule with the predicted best priorities on hardware threads (5, 4), instead of (4, 5), results in a 3.56% higher throughput as compared to default priorities – this is in contrast to the 20.05% throughput loss experienced when executed on hardware threads (4, 5). Although we have not verified it, one possible reason for this is that the processor core favors one hardware thread over the other. Re-executing the co-schedule (462.libquantum, 437.leslie3d) on hardware threads (5,4), instead of (4, 5), results in predicted priorities yielding lower throughput than the default priorities. Thus, we claim that predicted best priorities yield equal or higher throughput than default priorities for 20 of the 21 co-schedules studied, and for the outlier the throughput loss is 7.46%. (ii) Using non-default priorities improves throughput. The default priority pair yields best throughput for only six of the 21 co-schedules. For the remaining 15 the default priority pair yields throughput that is between 0.74% and 14.10% lower than that achieved with the best priority pair. (iii) Using the predicted best priority pair, rather than default priorities, improves throughput or at least provides throughput equal to that achieved with default priorities. For 11 of the 21 co-schedules both the default and predicted priorities yield equal throughput. For nine of the 21 predicted priorities yield throughput that is between 0.59% and 16.42% higher than that achieved with default priorities. For two of these nine co-schedules the predicted priority pair yields a throughput improvement of less than 5%. Furthermore, for three the throughput improvement associated with executing with the predicted priority pair, rather than default priorities, is between 5% and 10% and for the other four the improvement is greater than 10%. (iv) Using predicted best priority pairs appears to be most applicable to floating-point intensive applications: For eight co-schedules comprising applications for which the utilization of the floating-point unit exceeds that of the fixed-point unit by 10% or more, the predicted priority pairs, as compared to the default priorities, yield a throughput improvement between 3.56% and 16.42%. This result indicates that the methodology for predicting best priority pairs is most applicable to applications for which floating-point unit utilization dominates that of the fixed point unit by at least 10%. (Abstract shortened by UMI.

    PIR: PMaC’s Idiom Recognizer

    No full text
    Abstract—The speed of the memory subsystem often constrains the performance of large-scale parallel applications. Experts tune such applications to use hierarchical memory subsystems efficiently. Hardware accelerators, such as GPUs, can potentially improve memory performance beyond the capabilities of traditional hierarchical systems. However, the addition of such specialized hardware complicates code porting and tuning. During porting and tuning expert application engineers manually browse source code and identify memory access patterns that are candidates for optimization and tuning. HPC applications typically contain thousands to hundreds of thousands of lines of code, creating a laborintensive challenge for the expert. PIR, PMaC’s Static Idiom Recognizer, automates the pattern recognition process. PIR recognizes specified patterns and tags the source code where they appear using static analysis. This paper describes the PIR implementation and defines a subset of idioms commonly found in HPC applications. We examine the effectiveness of the tool, demonstrating 95 % identification accuracy and present the results of using PIR on two HPC applications. Keywords-automation; performance; static analysis; tuning I

    Safe use of recombinant activated factor VIIa for recalcitrant postoperative haemorrhage in cardiac surgery

    Full text link
    The aim of this case series is to review the effect of recombinant activated factor VIIa (rFVIIa) on refractory haemorrhage, despite aggressive treatment with conventional blood products and medications at our institution. All patients undergoing cardiac surgery who received rFVIIa as rescue therapy for persistent uncontrollable haemorrhage were studied. We examined coagulation immediately before and after rFVIIa was given; international normalized ratio (INR), activated partial thromboplastin (APTT) fibrinogen and platelet levels, in addition to the use of red cell and non-red cell blood products, morbidity and mortality. Thirty patients (0.6%) received 31 doses of rFVIIa for bleeding refractory to conventional treatment. Twenty received rFVIIa in theatre after primary surgery, three after re-exploration and eight in the intensive care unit (ICU). Hospital mortality was 6.5% (2y30) and there were no documented thromboembolic phenomena. There was significant reduction in red blood cell and product transfusion before and after rFVIIa administration (P-0.001). There was significant correction in coagulation parameters after rFVIIa. Recombinant FVIIa appears to be safe, and is effective in reducing red blood cell and product transfusion requirements and may impact on early and late outcomes in this small complex subgroup of patients

    Cross sectional analysis of mandibular anthropometric points using CBCT to derive biometric measurements for a safer approach to mandible osteotomies

    No full text
    Purpose: This study aims to derive a series of biometric measurements using cone-beam computed tomography (CBCT) from a cross sectional group of population to help the surgeon accurately locate the mandibular foramen and the mental foramen during mandibular osteotomies. Methods: CBCT images of 800 subjects were evaluated. Various measurements were noted and compared between the two sides of the mandible in and between the sexes. Result: Statistically significant values were noted between the right and left sides of Line X to Point A in female subjects, Line Z & Line B only in male subjects and Line X’ in both male and female subjects. However, Line Y was found to be significant when comparing both sides in both males and females and also on correlation between the genders. Conclusion: Although the identification of the mandibular lingula and anatomical landmarks is an important step during mandibular osteotomies, the position of one side, however, cannot be blindly extrapolated to the contra lateral side. Also, pre operative CBCT is a useful tool to derive measurements which when transferred clinically during the surgery gives an accurate and safe approach for localisation of lingula, thus reducing the incidence of post operative neurologic morbidities

    Long-term patency of 1108 radial arterial-coronary angiograms over 10 years

    Full text link
    Background. To avoid late vein graft atheroma and failure, we have used arterial grafts extensively in coronary operations. The radial artery (RA) is the conduit of second choice. This study determined the long-term patency of the RA as a coronary graft. Methods. Two independent observers evaluated 1108 consecutive postoperative RA conduit angiograms performed between January 1997 and June 2007 for cardiac symptoms. Mean time to postoperative angiography was 48.3 months (range, 1 to 132 months). An RA graft was considered failed (nonpatent) if there was stenosis exceeding 60%, string sign, or occlusion. Patency was determined over time, by coronary territory grafted and by the degree of native coronary artery stenosis (NCAS). Results. At a mean of 48.3 months, 982 of the 1108 RA grafts (89%) were patent. RA patencies for the left anterior descending were 96% (24 of 25), diagonal/intermediate, 90% (121 of 135); circumflex marginal, 89% (499 of 561); right coronary, 83% (38 of 46); posterior descending, 89% (253 of 286); and left ventricular branch/posterolateral, 86% (47 of 55). Patency was 87.5% (56 of 64) for NCAS of less than 60% compared with 89% (926 of 1044; p _ 0.89) for NCAS exceeding 60%. Of 318 RAs in place more than 5 years, 294 (92.5%) were patent, and for 107 RAs in place for more than 7 years, 99 were patent (92.5%). Patency was consistent through each year of the decade. Mechanisms of failure did not involve development of atherosclerosis. Patent RA grafts were smooth, with no angiographic evidence of atheroma. Conclusions. Late patencies of RA grafts are excellent and justify continuing use of the RA in coronary operations
    corecore