3,858 research outputs found
Reuse Distance Analysis for Large-Scale Chip Multiprocessors
Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program's detailed memory behavior. Concurrent Reuse Distance (CRD) and Private-stack Reuse Distance (PRD) measure RD across thread-interleaved memory reference streams, addressing shared and private caches. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. However such instability is minimal when all threads exhibit similar data-locality patterns. For loop-based parallel programs, interleaving threads are symmetric. CRD and PRD profiles are stable across cache size scaling, and exhibit predictable coherent movement across core count scaling. Hence, multicore RD analysis can provide accurate analysis for different processor configurations. Due to the prevalence of parallel loops, RD analysis will be valuable to multicore designers.
This dissertation uses RD analysis to analyze multicore cache performance for loop-based parallel programs. First, we study the impacts of core count scaling and problem size scaling on CRD and PRD profiles. Two application parameters with architectural implications are identified: Ccore and Cshare. Core count scaling only impacts cache performance significantly below Ccore in shared caches, and Cshare is the capacity at which shared caches begin to outperform private caches in terms of data locality. Then, we develop techniques, in particular employing reference groups, to predict the coherent movement of CRD and PRD profiles due to scaling, and achieve accuracy of 80%-96%. After comparing our prediction techniques against profile sampling, we find that the prediction achieves higher speedup and accuracy, especially when the design space is large. Moreover, we evaluate the accuracy of using CRD and PRD profile predictions to estimate multicore cache performance, especially MPKI. When combined with the existing problem scaling prediction, our techniques can predict shared LLC (private L2 cache) MPKI to within 12% (14%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles. Lastly, we propose a new framework based on RD analysis to optimize multicore cache hierarchies. Our study not only reveals several new insights, but it also demonstrates that RD analysis can help computer architects improve multicore designs
Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance
Performance on multicore processors is determined largely by on-chip
cache. Computer architects have conducted numerous studies in the past
that vary core count and cache capacity as well as problem size to
understand impact on cache behavior. These studies are very costly due
to the combinatorial design spaces they must explore.
Reuse distance (RD) analysis can help architects explore multicore cache
performance more efficiently. One problem, however, is multicore RD
analysis requires measuring concurrent reuse distance (CRD) profiles
across thread-interleaved memory reference streams. Sensitivity to
memory interleaving makes CRD profiles architecture dependent,
undermining RD analysis benefits. But for parallel programs with
symmetric threads, CRD profiles vary with architecture tractably: they
change only slightly with cache capacity scaling, and shift predictably
to larger CRD values with core count scaling. This enables analysis of a
large number of multicore configurations from a small set of measured
CRD profiles.
This paper investigates using RD analysis to efficiently analyze
multicore cache performance for parallel programs, making several
contributions. First, we characterize how CRD profiles change with core
count and cache capacity. One of our findings is core count scaling
degrades locality, but the degradation only impacts last-level caches
(LLCs) below 16MB for our benchmarks and problem sizes, increasing to
128MB if problem size scales by 64x. Second, we apply reference groups
to predict CRD profiles across core count scaling, and evaluate
prediction accuracy. Finally, we use CRD profiles to analyze multicore
cache performance. We find predicted CRD profiles can estimate LLC MPKI
within 76% of simulation for configurations without pathologic cache
conflicts in 1/1200th the time to perform simulation of the full design
space
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis
Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size
Scaling Single-Program Performance on Large-Scale Chip Multiprocessors
Due to power constraints, computer architects will exploit TLP instead of ILP for future performance gains. Today, 4-8 state-of-the-art cores or 10s of smaller cores can fit on a single die. For the foreseeable future, the number of cores will likely double with each successive processor generation. Hence, CMPs with 100s of cores-so-called large-scale chip multiprocessors (LCMPs)-will become a reality after only 2 or 3 generations.
Unfortunately, simply scaling the number of on-chip cores alone will not guarantee improved performance. In addition, effectively utilizing all of the cores is also necessary. Perhaps the greatest threat to processor utilization will be the overhead incurred waiting on the memory system, especially as on-chip concurrency scales to 100s of threads. In particular, remote cache bank access and off-chip bandwidth contention are likely to be the most significant obstacles to scaling memory performance.
This paper conducts an in-depth study of CMP scalability for parallel programs. We assume a tiled CMP in which tiles contain a simple core along with a private L1 cache and a local slice of a shared L2 cache. Our study considers scaling from 1-256 cores and 4-128MB of total L2 cache, and addresses several issues related to the impact of scaling on off-chip bandwidth and on-chip communication. In particular, we find off-chip bandwidth increases linearly with core count, but the rate of increase reduces dramatically once enough L2 cache is provided to capture inter-thread sharing. Our results also show for the range 1-256 cores, there should be ample on-chip bandwidth to support the communication requirements of our benchmarks. Finally, we find that applications become off-chip limited when their L2 cache miss rates exceed some minimum threshold. Moreover, we expect off-chip overheads to dominate on-chip overheads for memory intensive programs and LCMPs with aggressive cores
Parallelization of the SSCA#3 Benchmark on the RAW Processor
The MIT Raw machine provides a point-to-point interconnection network for transferring register values between tiles. The programmer schedules the network communication for each tile by himself/herself and guarantees the correctness. It is not easy to parallelize benchmarks by hand for all possible tile configurations on the Raw processor. To overcome this problem, we develop a communication library and a switch code generator to create the switch code for each tile automatically. We implement our techniques for the SSCA#3 (SAR Sensor Processing, Knowledge Formation) benchmark, and evaluate the parallelism on a physical Raw processor. The experimental results show the SSCA#3 benchmark has dense matrix operations with abundant parallelism. Using 16 tiles, the ’SAR image formation’ procedure achieves a speedup of 13.86, and the speedup of the ’object detection’ procedure is 9.98
Potassium {4-[(3S,6S,9S)-3,6-dibenzyl-9-isopropyl-4,7,10-trioxo-11–oxa-2,5,8-triazadodecyl]phenyl}trifluoroborate
[[abstract]]The reported compound 4 was synthesized and fully characterized by 1H NMR, 13C NMR, 11B NMR, 19F NMR, and high resolution mass spectrometry.[[booktype]]電子版[[countrycodes]]CH
The effects of solvent extraction on nanoporosity of marine-continental coal and mudstone
Coal and organic-rich mudstone develop massive nanopores, which control the storage of adsorbed and free gas, as well as fluid flows. Generation and retention of bitumen and hydrocarbons of oil window reservoirs add more uncertainty to the nanoporosity. Solvent extraction is a traditional way to regain unobstructed pore networks but may cause additional effects due to interactions with rocks, such as solvent adsorbing on clay surfaces or absorbing in kerogens. Selected marine-continental coal and mudstone in Eastern Ordos Basin were studied to investigate how pore structures are affected by these in-situ-sorptive compounds (namely residual bitumen and hydrocarbons) and altered by solvent extractions. Solvent extraction was performed to obtain bitumen-free subsamples. Organic petrology, bulk geochemical analyses and gas chromatography were used to characterize the samples and the extracts. Low-pressure argon and carbon dioxide adsorptions were utilized to characterize the nanopore structures of the samples before and after extraction. The samples, both coal and mudstone, are in oil windows, with vitrinite reflectance ranging from 0.807 to 1.135%. The coals are strongly affected by marine organic input, except for the sample C-4; the mudstones are sourced by either marine or terrestrial organic input, or their mixture. As for the coals affected by marine organic input, residual bitumen and hydrocarbons occupying or blocking pores <10 nm becomes weak with thermal maturation. Bitumen derived from terrestrial organic matter mainly affects small pores, since coal asphaltene molecules are much smaller than petroleum asphaltene molecules. The mudstone M-2 with high extract production showed an increase of nanopores after extraction, due to the exposure of the filled or blocked pores. However, most transitional mudstones saw decreases of the pores because pore shrinkage caused by solvents adsorbing on and swelling clay minerals (mainly kaolinite and illite/smectite mixed layers) counteracts the released pore spaces. Solvent extractions on the coals significantly increased the micropores <0.6 nm, since the heat of sorption of alkanes reaches the peak in the pores within 0.4–0.5 nm. By contrast, solvent extractions on the mudstones decreased the micropores ∼0.35 nm, which is perhaps caused by evaporative drying of solvent displacing residual water in clay
Recommended from our members
Revealing Nanoscale Solid-Solid Interfacial Phenomena for Long-Life and High-Energy All-Solid-State Batteries.
Enabling long cyclability of high-voltage oxide cathodes is a persistent challenge for all-solid-state batteries, largely because of their poor interfacial stabilities against sulfide solid electrolytes. While protective oxide coating layers such as LiNbO3 (LNO) have been proposed, its precise working mechanisms are still not fully understood. Existing literature attributes reductions in interfacial impedance growth to the coating's ability to prevent interfacial reactions. However, its true nature is more complex, with cathode interfacial reactions and electrolyte electrochemical decomposition occurring simultaneously, making it difficult to decouple each effect. Herein, we utilized various advanced characterization tools and first-principles calculations to probe the interfacial phenomenon between solid electrolyte Li6PS5Cl (LPSCl) and high-voltage cathode LiNi0.85Co0.1Al0.05O2 (NCA). We segregated the effects of spontaneous reaction between LPSCl and NCA at the interface and quantified the intrinsic electrochemical decomposition of LPSCl during cell cycling. Both experimental and computational results demonstrated improved thermodynamic stability between NCA and LPSCl after incorporation of the LNO coating. Additionally, we revealed the in situ passivation effect of LPSCl electrochemical decomposition. When combined, both these phenomena occurring at the first charge cycle result in a stabilized interface, enabling long cyclability of all-solid-state batteries
The antagonism between MCT-1 and p53 affects the tumorigenic outcomes
<p>Abstract</p> <p>Background</p> <p>MCT-1 oncoprotein accelerates p53 protein degradation via a proteosome pathway. Synergistic promotion of the xenograft tumorigenicity has been demonstrated in circumstance of p53 loss alongside MCT-1 overexpression. However, the molecular regulation between MCT-1 and p53 in tumor development remains ambiguous. We speculate that MCT-1 may counteract p53 through the diverse mechanisms that determine the tumorigenic outcomes.</p> <p>Results</p> <p>MCT-1 has now identified as a novel target gene of p53 transcriptional regulation. MCT-1 promoter region contains the response elements reactive with wild-type p53 but not mutant p53. Functional p53 suppresses MCT-1 promoter activity and MCT-1 mRNA stability. In a negative feedback regulation, constitutively expressed MCT-1 decreases p53 promoter function and p53 mRNA stability. The apoptotic events are also significantly prevented by oncogenic MCT-1 in a p53-dependent or a p53-independent fashion, according to the genotoxic mechanism. Moreover, oncogenic MCT-1 promotes the tumorigenicity in mice xenografts of p53-null and p53-positive lung cancer cells. In support of the tumor growth are irrepressible by p53 reactivation <it>in vivo</it>, the inhibitors of p53 (MDM2, Pirh2, and Cop1) are constantly stimulated by MCT-1 oncoprotein.</p> <p>Conclusions</p> <p>The oppositions between MCT-1 and p53 are firstly confirmed at multistage processes that include transcription control, mRNA metabolism, and protein expression. MCT-1 oncogenicity can overcome p53 function that persistently advances the tumor development.</p
Gas emissions in Planck cold dust clumps---A Survey of the J=1-0 Transitions of CO, CO, and CO
A survey toward 674 Planck cold clumps of the Early Cold Core Catalogue (ECC)
in the J=1-0 transitions of CO, CO and CO has been carried
out using the PMO 13.7 m telescope. 673 clumps were detected with the CO
and CO, and 68% of the samples have CO emission. Additional
velocity components were also identified.A close consistency of the three line
peak velocities was revealed for the first time. Kinematic distances are given
out for all the velocity components and half of the clumps are located within
0.5 and 1.5 kpc. Excitation temperatures range from 4 to 27 K, slightly larger
than those of . Line width analysis shows that the majority of ECC clumps
are low mass clumps. Column densities N span from 10 to
4.5 cm with an average value of
(4.43.6) cm. N cumulative fraction
distribution deviates from the lognormal distribution, which is attributed to
optical depth. The average abundance ratio of the CO to CO in
these clumps is 7.03.8, higher than the terrestrial value. Dust and gas
are well coupled in 95% of the clumps. Blue profile, red profile and line
asymmetry in total was found in less than 10% of the clumps, generally
indicating star formation is not developed yet. Ten clumps were mapped. Twelve
velocity components and 22 cores were obtained. Their morphologies include
extended diffuse, dense isolated, cometary and filament, of which the last is
the majority. 20 cores are starless.Only 7 cores seem to be in gravitationally
bound state. Planck cold clumps are the most quiescent among the samples of
weak-red IRAS, infrared dark clouds, UC H{\sc ii} region candidates, EGOs and
methanol maser sources, suggesting that Planck cold clumps have expanded the
horizon of cold Astronomy.Comment: Accepted to Ap
- …