123 research outputs found

    HeteroCore GPU to exploit TLP-resource diversity

    Get PDF

    Intra-cluster coalescing to reduce GPU NoC pressure

    Get PDF
    GPUs continue to increase the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Issuing redundant requests to access the same memory location wastes valuable NoC bandwidth - we find on average 19.4% (and up to 48%) of the requests to be redundant. To reduce redundant NoC traffic, we propose intracluster coalescing (ICC) to merge memory requests from different SMs in a cluster. Our evaluation results show that ICC achieves an average performance improvement of 9.7% (and up to 33%) over a conventional design

    Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure

    Get PDF
    GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Sending redundant requests to access the same memory location wastes valuable NoC bandwidth-we find on average 19 percent (and up to 48 percent) of the requests to be redundant. To remove redundant NoC traffic, we propose distributed-block scheduling, intra-cluster coalescing (ICC) and the coalesced cache (CC) to coalesce L1 cache misses within and across SMs in a cluster, respectively. Our evaluation results show that distributed-block scheduling, ICC and CC are complementary and improve both performance and energy consumption. We report an average performance improvement of 15 percent (and up to 67 percent) while at the same time reducing system energy by 6 percent (and up to 19 percent) and improving the energy-delay product (EDP) by 19 percent on average (and up to 53 percent), compared to state-of-the-art distributed CTA scheduling

    CD-Xbar : a converge-diverge crossbar network for high-performance GPUs

    Get PDF
    Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5 and 48.5 percent, respectively, while at the same time improving performance by 13.9 percent on average. CD-Xbar performs within 2.9 percent of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance perWatt (by 17.1 percent) over state-of-the-art GPU NoCs which are highly customized and non-scalable

    Adaptive memory-side last-level GPU caching

    Get PDF
    Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs. Although a shared LLC typically results in a lower miss rate, we find that for workloads with high degrees of data sharing across SMs, a private LLC leads to a significant performance advantage because of increased bandwidth to replicated cache lines across different LLC slices. In this paper, we propose adaptive memory-side last-level GPU caching to boost performance for sharing-intensive workloads that need high bandwidth to read-only shared data. Adaptive caching leverages a lightweight performance model that balances increased LLC bandwidth against increased miss rate under private caching. In addition to improving performance for sharing-intensive workloads, adaptive caching also saves energy in a (co-designed) hierarchical two-stage crossbar NoC by power-gating and bypassing the second stage if the LLC is configured as a private cache. Our experimental results using 17 GPU workloads show that adaptive caching improves performance by 28.1% on average (up to 38.1%) compared to a shared LLC for sharing-intensive workloads. In addition, adaptive caching reduces NoC energy by 26.6% on average (up to 29.7%) and total system energy by 6.1% on average (up to 27.2%) when configured as a private cache. Finally, we demonstrate through a GPU NoC design space exploration that a hierarchical two-stage crossbar is both more power- and area-efficient than full and concentrated crossbars with the same bisection bandwidth, thus providing a low-cost cooperative solution to exploit workload sharing behavior in memory-side last-level caches

    Synergetic Effect of Plasmonic Gold Nanorods and MgO for Perovskite Solar Cells

    Get PDF
    We report new structured perovskite solar cells (PSCs) using solution-processed TiO2/Au nanorods/MgO composite electron transport layers (ETLs). The proposed method is facile, convenient, and effective. Briefly, Au nanorods (NRs) were prepared and introduced into mesoporous TiO2 ETLs. Then, thin MgO overlayers were grown on the Au NRs modified ETLs by wet spinning and pyrolysis of the magnesium salt. By simultaneous use of Au NRs and MgO, the power conversion efficiency of the PSC device increases from 14.7% to 17.4%, displaying over 18.3% enhancement, compared with the reference device without modification. Due to longitudinal plasmon resonances (LPRs) of gold nanorods, the embedded Au NRs exhibit the ability to significantly enhance the near-field and far-field (plasmonic scattering), increase the optical path length of incident photons in the device, and as a consequence, notably improve external quantum efficiency (EQE) at wavelengths above 600 nm and power conversion efficiency (PCE) of PSC solar cells. Meanwhile, the thin MgO overlayer also contributes to enhanced performance by reducing charge recombination in the solar cell. Theoretical calculations were carried out to elucidate the PV performance enhancement mechanisms

    MLEE: A method for extracting object-level medical knowledge graph entities from Chinese clinical records

    Get PDF
    As a typical knowledge-intensive industry, the medical field uses knowledge graph technology to construct causal inference calculations, such as “symptom-disease”, “laboratory examination/imaging examination-disease”, and “disease-treatment method”. The continuous expansion of large electronic clinical records provides an opportunity to learn medical knowledge by machine learning. In this process, how to extract entities with a medical logic structure and how to make entity extraction more consistent with the logic of the text content in electronic clinical records are two issues that have become key in building a high-quality, medical knowledge graph. In this work, we describe a method for extracting medical entities using real Chinese clinical electronic clinical records. We define a computational architecture named MLEE to extract object-level entities with “object-attribute” dependencies. We conducted experiments based on randomly selected electronic clinical records of 1,000 patients from Shengjing Hospital of China Medical University to verify the effectiveness of the method

    Visualization of ultrasonic wave field by stroboscopic polarization selective imaging

    Get PDF
    A stroboscopic method based on polarization selective imaging is proposed for dynamic visualization of ultrasonic waves propagating in a transparent medium. Multiple independent polarization parametric images were obtained, which enabled quantitative evaluation of the distribution of the ultrasonic pressure in quartz. In addition to the detection of optical phase differences δ in conventional photo-elastic techniques, the azimuthal angle φ and the Stokes parameter S2 of the polarized light are found to be highly sensitive to the wave-induced refraction index distribution, opening a new window on ultrasonic field visualization

    Nadir CA-125 level as prognosis indicator of high-grade serous ovarian cancer

    Get PDF
    PURPOSE: The capacity of nadir CA-125 levels to predict the prognosis of epithelial ovarian cancer remains controversial. This study aimed to explore whether the nadir CA-125 serum levels could predict the durations of overall survival (OS) and progression free survival (PFS) in patients with high-grade serous ovarian cancer (HG-SOC) from the USA and PRC. MATERIALS AND METHODS: A total of 616 HG-SOC patients from the MD Anderson Cancer Center (MDACC, USA) between 1990 and 2011 were retrospectively analyzed. The results of 262 cases from the Jiangsu Institute of Cancer Research (JICR, PRC) between 1992 and 2011 were used to validate the MDACC data. The CA-125 immunohistochemistry assay was performed on 280 tissue specimens. The Cox proportional hazards model and the log-rank test were used to assess the associations between the clinicopathological characteristics and duration of survival. RESULTS: The nadir CA-125 level was an independent predictor of OS and PFS (p < 0.01 for both) in the MDACC patients. Lower nadir CA-125 levels (≤10 U/mL) were associated with longer OS and PFS (median: 61.2 and 16.8 months with 95% CI: 52.0–72.4 and 14.0–19.6 months, respectively) than their counterparts with shorter OS and PFS (median: 49.2 and 10.5 months with 95% CI: 41.7–56.7 and 6.9–14.1 months, respectively). The nadir CA-125 levels in JICR patients were similarly independent when predicting the OS and PFS (p < 0.01 for both). Nadir CA-125 levels less than or equal to 10 U/mL were associated with longer OS and PFS (median: 59.9 and 15.5 months with 95% CI: 49.7–70.1 and 10.6–20.4 months, respectively), as compared with those more than 10 U/mL (median: 42.0 and 9.0 months with 95% CI: 34.4–49.7 and 6.6–11.2 months, respectively). Baseline serum CA-125 levels, but not the CA-125 expression in tissues, were associated with the OS and PFS of HG-SOC patients in the MDACC and JICR groups. However, these values were not independent. Nadir CA-125 levels were not associated with the tumor burden based on second-look surgery (p = 0.09). Patients who achieved a pathologic complete response had longer OS and PFS (median: 73.7 and 20.7 months with 95% CI: 63.7–83.7 and 9.5–31.9 months, respectively) than those with residual tumors (median: 34.6 and 10.6 months with 95% CI: 6.9–62.3 and 4.9–16.3 months, respectively). CONCLUSIONS: The nadir CA-125 level was an independent predictor of OS and PFS in HG-SOC patients. Further prospective studies are required to clinically optimize the chances for a complete clinical response of HG-SOC cases with higher CA-125 levels (>10 U/mL) at the end of primary treatment

    KIR3DL1-Negative CD8 T Cells and KIR3DL1-Negative Natural Killer Cells Contribute to the Advantageous Control of Early Human Immunodeficiency Virus Type 1 Infection in HLA-B Bw4 Homozygous Individuals

    Get PDF
    Bw4 homozygosity in human leukocyte antigen class B alleles has been associated with a delayed acquired immunodeficiency syndrome (AIDS) development and better control of human immunodeficiency virus type 1 (HIV-1) viral load (VL) than Bw6 homozygosity. Efficient CD8 T cell and natural killer (NK) cell functions have been described to restrain HIV-1 replication. However, the role of KIR3DL1 expression on these cells was not assessed in Bw4-homozygous participants infected with HIV-1 CRF01_A/E subtype, currently the most prevalent subtype in China. Here, we found that the frequency of KIR3DL1-expressing CD8 T cells of individuals homozygous for Bw6 [1.53% (0–4.56%)] was associated with a higher VL set point (Spearman rs = 0.59, P = 0.019), but this frequency of KIR3DL1+CD8+ T cells [1.37% (0.04–6.14%)] was inversely correlated with CD4 T-cell count in individuals homozygous for Bw4 (rs = −0.59, P = 0.011). Moreover, CD69 and Ki67 were more frequently expressed in KIR3DL1−CD8+ T cells in individuals homozygous for Bw4 than Bw6 (P = 0.046 for CD69; P = 0.044 for Ki67), although these molecules were less frequently expressed in KIR3DL1+CD8+ T cells than in KIR3DL1−CD8+ T cells in both groups (all P &lt; 0.05). KIR3DL1−CD8+ T cells have stronger p24-specific CD8+ T-cell responses secreting IFN-γ and CD107a than KIR3DL1+CD8+ T cells in both groups (all P &lt; 0.05). Thus, KIR3DL1 expression on CD8 T cells were associated with the loss of multiple functions. Interestingly, CD69+NK cells lacking KIR3DL1 expression were inversely correlated with HIV-1 VL set point in Bw4-homozygous individuals (rs = −0.52, P = 0.035). Therefore, KIR3DL1−CD8+ T cells with strong early activation and proliferation may, together with KIR3DL1−CD69+NK cells, play a protective role during acute/early HIV infection in individuals homozygous for Bw4. These findings highlight the superior functions of KIR3DL1−CD8+ T cells and KIR3DL1−CD69+NK cells being a potential factor contributing to delayed disease progression in the early stages of HIV-1 infection
    corecore