Search CORE

3,858 research outputs found

Reuse Distance Analysis for Large-Scale Chip Multiprocessors

Author: Wu Meng-Ju
Publication venue
Publication date: 01/01/2012
Field of study

Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program's detailed memory behavior. Concurrent Reuse Distance (CRD) and Private-stack Reuse Distance (PRD) measure RD across thread-interleaved memory reference streams, addressing shared and private caches. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. However such instability is minimal when all threads exhibit similar data-locality patterns. For loop-based parallel programs, interleaving threads are symmetric. CRD and PRD profiles are stable across cache size scaling, and exhibit predictable coherent movement across core count scaling. Hence, multicore RD analysis can provide accurate analysis for different processor configurations. Due to the prevalence of parallel loops, RD analysis will be valuable to multicore designers. This dissertation uses RD analysis to analyze multicore cache performance for loop-based parallel programs. First, we study the impacts of core count scaling and problem size scaling on CRD and PRD profiles. Two application parameters with architectural implications are identified: Ccore and Cshare. Core count scaling only impacts cache performance significantly below Ccore in shared caches, and Cshare is the capacity at which shared caches begin to outperform private caches in terms of data locality. Then, we develop techniques, in particular employing reference groups, to predict the coherent movement of CRD and PRD profiles due to scaling, and achieve accuracy of 80%-96%. After comparing our prediction techniques against profile sampling, we find that the prediction achieves higher speedup and accuracy, especially when the design space is large. Moreover, we evaluate the accuracy of using CRD and PRD profile predictions to estimate multicore cache performance, especially MPKI. When combined with the existing problem scaling prediction, our techniques can predict shared LLC (private L2 cache) MPKI to within 12% (14%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles. Lastly, we propose a new framework based on RD analysis to optimize multicore cache hierarchies. Our study not only reveals several new insights, but it also demonstrates that RD analysis can help computer architects improve multicore designs

Digital Repository at the University of Maryland

Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance

Author: Wu Meng-Ju
Yeung Donald
Publication venue
Publication date: 05/10/2010
Field of study

Performance on multicore processors is determined largely by on-chip cache. Computer architects have conducted numerous studies in the past that vary core count and cache capacity as well as problem size to understand impact on cache behavior. These studies are very costly due to the combinatorial design spaces they must explore. Reuse distance (RD) analysis can help architects explore multicore cache performance more efficiently. One problem, however, is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, undermining RD analysis benefits. But for parallel programs with symmetric threads, CRD profiles vary with architecture tractably: they change only slightly with cache capacity scaling, and shift predictably to larger CRD values with core count scaling. This enables analysis of a large number of multicore configurations from a small set of measured CRD profiles. This paper investigates using RD analysis to efficiently analyze multicore cache performance for parallel programs, making several contributions. First, we characterize how CRD profiles change with core count and cache capacity. One of our findings is core count scaling degrades locality, but the degradation only impacts last-level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 128MB if problem size scales by 64x. Second, we apply reference groups to predict CRD profiles across core count scaling, and evaluate prediction accuracy. Finally, we use CRD profiles to analyze multicore cache performance. We find predicted CRD profiles can estimate LLC MPKI within 76% of simulation for configurations without pathologic cache conflicts in 1/1200th the time to perform simulation of the full design space

Digital Repository at the University of Maryland

Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

Author: Donald Yeung
Meng-ju Wu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size

CiteSeerX

Crossref

Scaling Single-Program Performance on Large-Scale Chip Multiprocessors

Author: Wu Meng-Ju
Yeung Donald
Publication venue
Publication date: 25/11/2009
Field of study

Due to power constraints, computer architects will exploit TLP instead of ILP for future performance gains. Today, 4-8 state-of-the-art cores or 10s of smaller cores can fit on a single die. For the foreseeable future, the number of cores will likely double with each successive processor generation. Hence, CMPs with 100s of cores-so-called large-scale chip multiprocessors (LCMPs)-will become a reality after only 2 or 3 generations. Unfortunately, simply scaling the number of on-chip cores alone will not guarantee improved performance. In addition, effectively utilizing all of the cores is also necessary. Perhaps the greatest threat to processor utilization will be the overhead incurred waiting on the memory system, especially as on-chip concurrency scales to 100s of threads. In particular, remote cache bank access and off-chip bandwidth contention are likely to be the most significant obstacles to scaling memory performance. This paper conducts an in-depth study of CMP scalability for parallel programs. We assume a tiled CMP in which tiles contain a simple core along with a private L1 cache and a local slice of a shared L2 cache. Our study considers scaling from 1-256 cores and 4-128MB of total L2 cache, and addresses several issues related to the impact of scaling on off-chip bandwidth and on-chip communication. In particular, we find off-chip bandwidth increases linearly with core count, but the rate of increase reduces dramatically once enough L2 cache is provided to capture inter-thread sharing. Our results also show for the range 1-256 cores, there should be ample on-chip bandwidth to support the communication requirements of our benchmarks. Finally, we find that applications become off-chip limited when their L2 cache miss rates exceed some minimum threshold. Moreover, we expect off-chip overheads to dominate on-chip overheads for memory intensive programs and LCMPs with aggressive cores

Digital Repository at the University of Maryland

Parallelization of the SSCA#3 Benchmark on the RAW Processor

Author: Wu Meng-Ju
Yeung Donald
Publication venue
Publication date: 06/11/2006
Field of study

The MIT Raw machine provides a point-to-point interconnection network for transferring register values between tiles. The programmer schedules the network communication for each tile by himself/herself and guarantees the correctness. It is not easy to parallelize benchmarks by hand for all possible tile configurations on the Raw processor. To overcome this problem, we develop a communication library and a switch code generator to create the switch code for each tile automatically. We implement our techniques for the SSCA#3 (SAR Sensor Processing, Knowledge Formation) benchmark, and evaluate the parallelism on a physical Raw processor. The experimental results show the SSCA#3 benchmark has dense matrix operations with abundant parallelism. Using 16 tiles, the ’SAR image formation’ procedure achieves a speedup of 13.86, and the speedup of the ’object detection’ procedure is 9.98

Digital Repository at the University of Maryland

Potassium {4-[(3S,6S,9S)-3,6-dibenzyl-9-isopropyl-4,7,10-trioxo-11–oxa-2,5,8-triazadodecyl]phenyl}trifluoroborate

Author: Chia-Chieh Fu
Chia-Hua Tsai
Chia-Hung Lin
Chih-Cheng Cai
Ching-Tien Hsieh
Fang-Rong Chang
Meng-Hsuan Lin
Meng-Ju Wu
Pin-Yi Liu
Po-Shen Pan
Ting-Ju Lin
Yang-Chang Wu
Publication venue: 'MDPI AG'
Publication date: 01/05/2014
Field of study

[[abstract]]The reported compound 4 was synthesized and fully characterized by 1H NMR, 13C NMR, 11B NMR, 19F NMR, and high resolution mass spectrometry.[[booktype]]電子版[[countrycodes]]CH

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Tamkang University Institutional Repository

The effects of solvent extraction on nanoporosity of marine-continental coal and mudstone

Author: Cai Jianchao
Chen Wangang
Gao Yuan
Hunag Cheng
Ju Yiwen
Meng Shangzhi
Qi Yu
Wu Jianguang
Zhu Hongjian
Publication venue: Elsevier
Publication date: 01/01/2019
Field of study

Coal and organic-rich mudstone develop massive nanopores, which control the storage of adsorbed and free gas, as well as fluid flows. Generation and retention of bitumen and hydrocarbons of oil window reservoirs add more uncertainty to the nanoporosity. Solvent extraction is a traditional way to regain unobstructed pore networks but may cause additional effects due to interactions with rocks, such as solvent adsorbing on clay surfaces or absorbing in kerogens. Selected marine-continental coal and mudstone in Eastern Ordos Basin were studied to investigate how pore structures are affected by these in-situ-sorptive compounds (namely residual bitumen and hydrocarbons) and altered by solvent extractions. Solvent extraction was performed to obtain bitumen-free subsamples. Organic petrology, bulk geochemical analyses and gas chromatography were used to characterize the samples and the extracts. Low-pressure argon and carbon dioxide adsorptions were utilized to characterize the nanopore structures of the samples before and after extraction. The samples, both coal and mudstone, are in oil windows, with vitrinite reflectance ranging from 0.807 to 1.135%. The coals are strongly affected by marine organic input, except for the sample C-4; the mudstones are sourced by either marine or terrestrial organic input, or their mixture. As for the coals affected by marine organic input, residual bitumen and hydrocarbons occupying or blocking pores <10 nm becomes weak with thermal maturation. Bitumen derived from terrestrial organic matter mainly affects small pores, since coal asphaltene molecules are much smaller than petroleum asphaltene molecules. The mudstone M-2 with high extract production showed an increase of nanopores after extraction, due to the exposure of the filled or blocked pores. However, most transitional mudstones saw decreases of the pores because pore shrinkage caused by solvents adsorbing on and swelling clay minerals (mainly kaolinite and illite/smectite mixed layers) counteracts the released pore spaces. Solvent extractions on the coals significantly increased the micropores <0.6 nm, since the heat of sorption of alkanes reaches the peak in the pores within 0.4–0.5 nm. By contrast, solvent extractions on the mudstones decreased the micropores ∼0.35 nm, which is perhaps caused by evaporative drying of solvent displacing residual water in clay

Durham Research Online

Recommended from our members

Revealing Nanoscale Solid-Solid Interfacial Phenomena for Long-Life and High-Energy All-Solid-State Batteries.

Author: Banerjee Abhik
Cheng Ju-Hsiang
D'Souza Macwin Savio
Doux Jean-Marie
Ma Lu
Meng Ying Shirley
Nguyen Han
Ong Shyue Ping
Sterbinsky George E
Tan Darren HS
Tang Hanmei
Wang Xuefeng
Wu Erik A
Wu Tianpin
Wynn Thomas A
Zhang Minghao
Publication venue: eScholarship, University of California
Publication date: 01/11/2019
Field of study

Enabling long cyclability of high-voltage oxide cathodes is a persistent challenge for all-solid-state batteries, largely because of their poor interfacial stabilities against sulfide solid electrolytes. While protective oxide coating layers such as LiNbO3 (LNO) have been proposed, its precise working mechanisms are still not fully understood. Existing literature attributes reductions in interfacial impedance growth to the coating's ability to prevent interfacial reactions. However, its true nature is more complex, with cathode interfacial reactions and electrolyte electrochemical decomposition occurring simultaneously, making it difficult to decouple each effect. Herein, we utilized various advanced characterization tools and first-principles calculations to probe the interfacial phenomenon between solid electrolyte Li6PS5Cl (LPSCl) and high-voltage cathode LiNi0.85Co0.1Al0.05O2 (NCA). We segregated the effects of spontaneous reaction between LPSCl and NCA at the interface and quantified the intrinsic electrochemical decomposition of LPSCl during cell cycling. Both experimental and computational results demonstrated improved thermodynamic stability between NCA and LPSCl after incorporation of the LNO coating. Additionally, we revealed the in situ passivation effect of LPSCl electrochemical decomposition. When combined, both these phenomena occurring at the first charge cycle result in a stabilized interface, enabling long cyclability of all-solid-state batteries

eScholarship - University of California

The antagonism between MCT-1 and p53 affects the tumorigenic outcomes

Author: Chen Linyi
Choy ChikOn
Hsu Hsin-Ling
Kasiappan Ravi
Lin Tai-Du
Shih Hung-Ju
Wu Meng-Hsun
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background MCT-1 oncoprotein accelerates p53 protein degradation via a proteosome pathway. Synergistic promotion of the xenograft tumorigenicity has been demonstrated in circumstance of p53 loss alongside MCT-1 overexpression. However, the molecular regulation between MCT-1 and p53 in tumor development remains ambiguous. We speculate that MCT-1 may counteract p53 through the diverse mechanisms that determine the tumorigenic outcomes. Results MCT-1 has now identified as a novel target gene of p53 transcriptional regulation. MCT-1 promoter region contains the response elements reactive with wild-type p53 but not mutant p53. Functional p53 suppresses MCT-1 promoter activity and MCT-1 mRNA stability. In a negative feedback regulation, constitutively expressed MCT-1 decreases p53 promoter function and p53 mRNA stability. The apoptotic events are also significantly prevented by oncogenic MCT-1 in a p53-dependent or a p53-independent fashion, according to the genotoxic mechanism. Moreover, oncogenic MCT-1 promotes the tumorigenicity in mice xenografts of p53-null and p53-positive lung cancer cells. In support of the tumor growth are irrepressible by p53 reactivation <it>in vivo</it>, the inhibitors of p53 (MDM2, Pirh2, and Cop1) are constantly stimulated by MCT-1 oncoprotein. Conclusions The oppositions between MCT-1 and p53 are firstly confirmed at multistage processes that include transcription control, mRNA metabolism, and protein expression. MCT-1 oncogenicity can overcome p53 function that persistently advances the tumor development.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

National Health Research Institues

PubMed Central

Gas emissions in Planck cold dust clumps---A Survey of the J=1-0 Transitions of $^{12}$ CO, $^{13}$ CO, and C $^{18}$ O

Author: Ade
Bergin
Beuther
Bing-Gang Ju
Cesaroni
Chen
Dame
Di Li
Dobashi
Du
Egan
Evans
Fanyi Meng
Goodman
Guilloteau
Gómez
Hartquist
Harvey
Heithausen
Hennebelle
Huard
Liu
Liu
Mardones
Molinari
Qin
Rathborne
Ridge
Sheng-Li Qin
Simon
Sridharan
Sridharan
Tie Liu
Ungerechts
Velusamy
Watson
Wu
Yamamoto
Yuefang Wu
Zhang
Publication venue: 'IOP Publishing'
Publication date: 01/01/2012
Field of study

A survey toward 674 Planck cold clumps of the Early Cold Core Catalogue (ECC) in the J=1-0 transitions of

^{12}

CO,

^{13}

CO and C

^{18}

O has been carried out using the PMO 13.7 m telescope. 673 clumps were detected with the

^{12}

CO and

^{13}

CO, and 68% of the samples have C

^{18}

O emission. Additional velocity components were also identified.A close consistency of the three line peak velocities was revealed for the first time. Kinematic distances are given out for all the velocity components and half of the clumps are located within 0.5 and 1.5 kpc. Excitation temperatures range from 4 to 27 K, slightly larger than those of

T_d

. Line width analysis shows that the majority of ECC clumps are low mass clumps. Column densities N

_{H_{2}}

span from 10

^{20}

to 4.5

\times10^{22}

^{-2}

with an average value of (4.4

\pm

3.6)

\times10^{21}

^{-2}

. N

_{H_{2}}

cumulative fraction distribution deviates from the lognormal distribution, which is attributed to optical depth. The average abundance ratio of the

^{13}

CO to C

^{18}

O in these clumps is 7.0

\pm

3.8, higher than the terrestrial value. Dust and gas are well coupled in 95% of the clumps. Blue profile, red profile and line asymmetry in total was found in less than 10% of the clumps, generally indicating star formation is not developed yet. Ten clumps were mapped. Twelve velocity components and 22 cores were obtained. Their morphologies include extended diffuse, dense isolated, cometary and filament, of which the last is the majority. 20 cores are starless.Only 7 cores seem to be in gravitationally bound state. Planck cold clumps are the most quiescent among the samples of weak-red IRAS, infrared dark clouds, UC H{\sc ii} region candidates, EGOs and methanol maser sources, suggesting that Planck cold clumps have expanded the horizon of cold Astronomy.Comment: Accepted to Ap

arXiv.org e-Print Archive

Crossref