699 research outputs found
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Using Proportional-Integral-Differential approach for Dynamic Traffic Prediction in Wireless Network-on-Chip
The massive integration of cores in multi-core system has enabled chip designer to design systems while meeting the power performance demands of the applications. Wireless interconnection has emerged as an energy efficient solution to the challenges of multi-hop communication over the wireline paths in conventional Networks-on-Chips (NoCs). However, to ensure the full benefits of this novel interconnect technology, design of simple, fair and efficient Medium Access Control (MAC) mechanism to grant access to the on-chip wireless communication channel is needed. Moreover, to adapt to the varying traffic demands from the applications running on a multicore environment, MAC mechanisms should dynamically adjust the transmission slots of the wireless interfaces (WIs). To ensure an efficient utilization of the wireless medium in a Wireless NoC (WiNoC), in this work we present the design of prediction model that is used by two dynamic MAC mechanism to predict the traffic demand of the WIs and respond accordingly by adjusting transmission slots of the WIs. Through system level simulations, we show that the traffic aware MAC mechanisms are more energy efficient as well as capable of sustaining higher data bandwidth in WiNoCs
Leveraging heterogeneity in DRAM main memories to accelerate critical word access
pre-printThe DRAM main memory system in modern servers is largely homogeneous. In recent years, DRAM manufacturers have produced chips with vastly differing latency and energy characteristics. This provides the opportunity to build a heterogeneous main memory system where different parts of the address space can yield different latencies and energy per access. The limited prior work in this area has explored smart placement of pages with high activities. In this paper, we propose a novel alternative to exploit DRAM heterogeneity. We observe that the critical word in a cache line can be easily recognized beforehand and placed in a ow-latency region of the main memory. Other non-critical words of the cache line can be placed in a low-energy region. We design an architecture that has low complexity and that can accelerate the transfer of the critical word by tens of cycles. For our benchmark suite, we show an average performance improvement of 12.9% and an accompanying memory energy reduction of 15%
Integrated Transversal Equalizers in High-Speed Fiber-Optic Systems
Intersymbol interference (ISI) caused by intermodal dispersion in multimode fibers is the major limiting factor in the achievable data rate or transmission distance in high-speed multimode fiber-optic links for local area networks applications. Compared with optical-domain and other electrical-domain dispersion compensation methods, equalization with transversal filters based on distributed circuit techniques presents a cost-effective and low-power solution. The design of integrated distributed transversal equalizers is described in detail with focus on delay lines and gain stages. This seven-tap distributed transversal equalizer prototype has been implemented in a commercial 0.18-µm SiGe BiCMOS process for 10-Gb/s multimode fiber-optic links. A seven-tap distributed transversal equalizer reduces the ISI of a 10-Gb/s signal after 800 m of 50-µm multimode fiber from 5 to 1.38 dB, and improves the bit-error rate from about 10^-5 to less than 10^-12
Efficient Modeling of Random Sampling-Based LRU Cache
The Miss Ratio Curve (MRC) is an important metric and effective tool for caching system performance prediction and optimization. Since the Least Recently Used (LRU) replacement policy is the de facto policy for many existing caching systems, most previous studies on efficient MRC construction are predominantly focused on the LRU replacement policy. Recently, the random sampling-based replacement mechanism, as opposed to replacement relying on the rigid LRU data structure, gains more popularity due to its lightweight and flexibility. To approximate LRU, at replacement times, the system randomly selects K objects and replaces the least recently used object among the sample. Redis implements this approximated LRU policy. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size K; therefore existing LRU MRC construction techniques cannot be directly applied to random sampling based LRU cache without loss of accuracy.
In this thesis, we present a new probabilistic stack algorithm named KRR which can be used to accurately model random sampling based-LRU cache with arbitrary sampling size K. We propose two efficient stack update algorithms which reduce the expected running time of KRR from O(NM) to O(Nlog^2M) and O(NlogM), respectively, where N is the workload length and M is the number of distinct objects. Our implementation generates accurate miss ratio curves for both fixed and variable block size cache. Furthermore, we adopt spatial sampling which further reduces the running time of KRR by several orders of magnitude, and thus enables practical, low overhead online application of KRR
- …