3,858 research outputs found

    Reuse Distance Analysis for Large-Scale Chip Multiprocessors

    Get PDF
    Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program's detailed memory behavior. Concurrent Reuse Distance (CRD) and Private-stack Reuse Distance (PRD) measure RD across thread-interleaved memory reference streams, addressing shared and private caches. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. However such instability is minimal when all threads exhibit similar data-locality patterns. For loop-based parallel programs, interleaving threads are symmetric. CRD and PRD profiles are stable across cache size scaling, and exhibit predictable coherent movement across core count scaling. Hence, multicore RD analysis can provide accurate analysis for different processor configurations. Due to the prevalence of parallel loops, RD analysis will be valuable to multicore designers. This dissertation uses RD analysis to analyze multicore cache performance for loop-based parallel programs. First, we study the impacts of core count scaling and problem size scaling on CRD and PRD profiles. Two application parameters with architectural implications are identified: Ccore and Cshare. Core count scaling only impacts cache performance significantly below Ccore in shared caches, and Cshare is the capacity at which shared caches begin to outperform private caches in terms of data locality. Then, we develop techniques, in particular employing reference groups, to predict the coherent movement of CRD and PRD profiles due to scaling, and achieve accuracy of 80%-96%. After comparing our prediction techniques against profile sampling, we find that the prediction achieves higher speedup and accuracy, especially when the design space is large. Moreover, we evaluate the accuracy of using CRD and PRD profile predictions to estimate multicore cache performance, especially MPKI. When combined with the existing problem scaling prediction, our techniques can predict shared LLC (private L2 cache) MPKI to within 12% (14%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles. Lastly, we propose a new framework based on RD analysis to optimize multicore cache hierarchies. Our study not only reveals several new insights, but it also demonstrates that RD analysis can help computer architects improve multicore designs

    Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance

    Get PDF
    Performance on multicore processors is determined largely by on-chip cache. Computer architects have conducted numerous studies in the past that vary core count and cache capacity as well as problem size to understand impact on cache behavior. These studies are very costly due to the combinatorial design spaces they must explore. Reuse distance (RD) analysis can help architects explore multicore cache performance more efficiently. One problem, however, is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, undermining RD analysis benefits. But for parallel programs with symmetric threads, CRD profiles vary with architecture tractably: they change only slightly with cache capacity scaling, and shift predictably to larger CRD values with core count scaling. This enables analysis of a large number of multicore configurations from a small set of measured CRD profiles. This paper investigates using RD analysis to efficiently analyze multicore cache performance for parallel programs, making several contributions. First, we characterize how CRD profiles change with core count and cache capacity. One of our findings is core count scaling degrades locality, but the degradation only impacts last-level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 128MB if problem size scales by 64x. Second, we apply reference groups to predict CRD profiles across core count scaling, and evaluate prediction accuracy. Finally, we use CRD profiles to analyze multicore cache performance. We find predicted CRD profiles can estimate LLC MPKI within 76% of simulation for configurations without pathologic cache conflicts in 1/1200th the time to perform simulation of the full design space

    Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

    Full text link
    Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size

    Scaling Single-Program Performance on Large-Scale Chip Multiprocessors

    Get PDF
    Due to power constraints, computer architects will exploit TLP instead of ILP for future performance gains. Today, 4-8 state-of-the-art cores or 10s of smaller cores can fit on a single die. For the foreseeable future, the number of cores will likely double with each successive processor generation. Hence, CMPs with 100s of cores-so-called large-scale chip multiprocessors (LCMPs)-will become a reality after only 2 or 3 generations. Unfortunately, simply scaling the number of on-chip cores alone will not guarantee improved performance. In addition, effectively utilizing all of the cores is also necessary. Perhaps the greatest threat to processor utilization will be the overhead incurred waiting on the memory system, especially as on-chip concurrency scales to 100s of threads. In particular, remote cache bank access and off-chip bandwidth contention are likely to be the most significant obstacles to scaling memory performance. This paper conducts an in-depth study of CMP scalability for parallel programs. We assume a tiled CMP in which tiles contain a simple core along with a private L1 cache and a local slice of a shared L2 cache. Our study considers scaling from 1-256 cores and 4-128MB of total L2 cache, and addresses several issues related to the impact of scaling on off-chip bandwidth and on-chip communication. In particular, we find off-chip bandwidth increases linearly with core count, but the rate of increase reduces dramatically once enough L2 cache is provided to capture inter-thread sharing. Our results also show for the range 1-256 cores, there should be ample on-chip bandwidth to support the communication requirements of our benchmarks. Finally, we find that applications become off-chip limited when their L2 cache miss rates exceed some minimum threshold. Moreover, we expect off-chip overheads to dominate on-chip overheads for memory intensive programs and LCMPs with aggressive cores

    Parallelization of the SSCA#3 Benchmark on the RAW Processor

    Get PDF
    The MIT Raw machine provides a point-to-point interconnection network for transferring register values between tiles. The programmer schedules the network communication for each tile by himself/herself and guarantees the correctness. It is not easy to parallelize benchmarks by hand for all possible tile configurations on the Raw processor. To overcome this problem, we develop a communication library and a switch code generator to create the switch code for each tile automatically. We implement our techniques for the SSCA#3 (SAR Sensor Processing, Knowledge Formation) benchmark, and evaluate the parallelism on a physical Raw processor. The experimental results show the SSCA#3 benchmark has dense matrix operations with abundant parallelism. Using 16 tiles, the ’SAR image formation’ procedure achieves a speedup of 13.86, and the speedup of the ’object detection’ procedure is 9.98

    Potassium {4-[(3S,6S,9S)-3,6-dibenzyl-9-isopropyl-4,7,10-trioxo-11–oxa-2,5,8-triazadodecyl]phenyl}trifluoroborate

    Get PDF
    [[abstract]]The reported compound 4 was synthesized and fully characterized by 1H NMR, 13C NMR, 11B NMR, 19F NMR, and high resolution mass spectrometry.[[booktype]]電子版[[countrycodes]]CH

    The effects of solvent extraction on nanoporosity of marine-continental coal and mudstone

    Get PDF
    Coal and organic-rich mudstone develop massive nanopores, which control the storage of adsorbed and free gas, as well as fluid flows. Generation and retention of bitumen and hydrocarbons of oil window reservoirs add more uncertainty to the nanoporosity. Solvent extraction is a traditional way to regain unobstructed pore networks but may cause additional effects due to interactions with rocks, such as solvent adsorbing on clay surfaces or absorbing in kerogens. Selected marine-continental coal and mudstone in Eastern Ordos Basin were studied to investigate how pore structures are affected by these in-situ-sorptive compounds (namely residual bitumen and hydrocarbons) and altered by solvent extractions. Solvent extraction was performed to obtain bitumen-free subsamples. Organic petrology, bulk geochemical analyses and gas chromatography were used to characterize the samples and the extracts. Low-pressure argon and carbon dioxide adsorptions were utilized to characterize the nanopore structures of the samples before and after extraction. The samples, both coal and mudstone, are in oil windows, with vitrinite reflectance ranging from 0.807 to 1.135%. The coals are strongly affected by marine organic input, except for the sample C-4; the mudstones are sourced by either marine or terrestrial organic input, or their mixture. As for the coals affected by marine organic input, residual bitumen and hydrocarbons occupying or blocking pores <10 nm becomes weak with thermal maturation. Bitumen derived from terrestrial organic matter mainly affects small pores, since coal asphaltene molecules are much smaller than petroleum asphaltene molecules. The mudstone M-2 with high extract production showed an increase of nanopores after extraction, due to the exposure of the filled or blocked pores. However, most transitional mudstones saw decreases of the pores because pore shrinkage caused by solvents adsorbing on and swelling clay minerals (mainly kaolinite and illite/smectite mixed layers) counteracts the released pore spaces. Solvent extractions on the coals significantly increased the micropores <0.6 nm, since the heat of sorption of alkanes reaches the peak in the pores within 0.4–0.5 nm. By contrast, solvent extractions on the mudstones decreased the micropores ∼0.35 nm, which is perhaps caused by evaporative drying of solvent displacing residual water in clay

    The antagonism between MCT-1 and p53 affects the tumorigenic outcomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>MCT-1 oncoprotein accelerates p53 protein degradation via a proteosome pathway. Synergistic promotion of the xenograft tumorigenicity has been demonstrated in circumstance of p53 loss alongside MCT-1 overexpression. However, the molecular regulation between MCT-1 and p53 in tumor development remains ambiguous. We speculate that MCT-1 may counteract p53 through the diverse mechanisms that determine the tumorigenic outcomes.</p> <p>Results</p> <p>MCT-1 has now identified as a novel target gene of p53 transcriptional regulation. MCT-1 promoter region contains the response elements reactive with wild-type p53 but not mutant p53. Functional p53 suppresses MCT-1 promoter activity and MCT-1 mRNA stability. In a negative feedback regulation, constitutively expressed MCT-1 decreases p53 promoter function and p53 mRNA stability. The apoptotic events are also significantly prevented by oncogenic MCT-1 in a p53-dependent or a p53-independent fashion, according to the genotoxic mechanism. Moreover, oncogenic MCT-1 promotes the tumorigenicity in mice xenografts of p53-null and p53-positive lung cancer cells. In support of the tumor growth are irrepressible by p53 reactivation <it>in vivo</it>, the inhibitors of p53 (MDM2, Pirh2, and Cop1) are constantly stimulated by MCT-1 oncoprotein.</p> <p>Conclusions</p> <p>The oppositions between MCT-1 and p53 are firstly confirmed at multistage processes that include transcription control, mRNA metabolism, and protein expression. MCT-1 oncogenicity can overcome p53 function that persistently advances the tumor development.</p

    Gas emissions in Planck cold dust clumps---A Survey of the J=1-0 Transitions of 12^{12}CO, 13^{13}CO, and C18^{18}O

    Full text link
    A survey toward 674 Planck cold clumps of the Early Cold Core Catalogue (ECC) in the J=1-0 transitions of 12^{12}CO, 13^{13}CO and C18^{18}O has been carried out using the PMO 13.7 m telescope. 673 clumps were detected with the 12^{12}CO and 13^{13}CO, and 68% of the samples have C18^{18}O emission. Additional velocity components were also identified.A close consistency of the three line peak velocities was revealed for the first time. Kinematic distances are given out for all the velocity components and half of the clumps are located within 0.5 and 1.5 kpc. Excitation temperatures range from 4 to 27 K, slightly larger than those of TdT_d. Line width analysis shows that the majority of ECC clumps are low mass clumps. Column densities NH2_{H_{2}} span from 1020^{20} to 4.5×1022\times10^{22} cm2^{-2} with an average value of (4.4±\pm3.6)×1021\times10^{21} cm2^{-2}. NH2_{H_{2}} cumulative fraction distribution deviates from the lognormal distribution, which is attributed to optical depth. The average abundance ratio of the 13^{13}CO to C18^{18}O in these clumps is 7.0±\pm3.8, higher than the terrestrial value. Dust and gas are well coupled in 95% of the clumps. Blue profile, red profile and line asymmetry in total was found in less than 10% of the clumps, generally indicating star formation is not developed yet. Ten clumps were mapped. Twelve velocity components and 22 cores were obtained. Their morphologies include extended diffuse, dense isolated, cometary and filament, of which the last is the majority. 20 cores are starless.Only 7 cores seem to be in gravitationally bound state. Planck cold clumps are the most quiescent among the samples of weak-red IRAS, infrared dark clouds, UC H{\sc ii} region candidates, EGOs and methanol maser sources, suggesting that Planck cold clumps have expanded the horizon of cold Astronomy.Comment: Accepted to Ap
    corecore