49 research outputs found

    Proteus: Simulating the Performance of Distributed DNN Training

    Full text link
    DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this paper, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0%3.0\% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%133.8\%

    YaSpMV: Yet another SpMV framework on GPUs

    No full text
    SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM

    Responses of Soil Organic Carbon Decomposition and Temperature Sensitivity to N and P Fertilization in Different Soil Aggregates in a Subtropical Forest

    No full text
    Soil organic carbon (SOC) decomposition, a key process controlling the carbon (C) loss from terrestrial soils to the atmosphere, varies with soil aggregate size and is influenced by increasing nitrogen (N) and phosphorus (P) inputs from anthropogenic activities. However, how increasing N and P affects SOC decomposition and its temperature sensitivity (Q10) in soil aggregates remains unclear. Thus, we collected soils from a subtropical Cunninghamia lanceolata forest receiving N and P addition for 8 years to explore the interactive effects of N and P fertilization on SOC decomposition and its Q10 in mega-aggregates (>2 mm, MeA), macroaggregates (0.25–2.0 mm, MaA), and microaggregates (Q10. Specifically, SOC decomposition in MiA is 49.2% and 26.0% higher than MeA and MaA, respectively. Moreover, the averaged Q10 values were 2.29, 2.26 and 1.83 in MeA, MaA and MiA. SOC decomposition significantly increased by 39.4% in MaA and 23.7% in MiA with N fertilization, but P fertilization had less impact. However, P fertilization increased Q10 by 46.7% in MeA and 46.6% in MaA. Furthermore, we found P fertilization changed the influences of N fertilization on SOC decomposition in MaA and MiA but had no effect on responses of Q10 to N fertilization. Overall, our findings suggested that there were differences in SOC decomposition and Q10 among aggregates, and fertilization treatment had an impact on them. Our results highlighted the significance of considering differences in SOC decomposition and its response to climate warming and nutrient input among different aggregates in the prediction of SOC dynamics and its feedback to environmental changes in terrestrial ecosystems under climate warming scenarios

    The complete chloroplast genome sequence of Lithocarpus litseifolius (Hance) Chun 1837 (Fagaceae) and phylogenetic analysis

    No full text
    Lithocarpus litseifolius (Hance) Chun 1837 is an evergreen tree of Fagaceae, which can be used as sweet tea, natural sweetener, and precious medicinal material. The complete chloroplast genome of L. litseifolius was sequenced and its phylogenetic relationship was analyzed in this study. The chloroplast genome of L. litseifolius has a circular structure with a length of 161,322 bp, and it contains a pair of inverted repeat regions (IRs 25,897 bp), a large single copy (LSC 90,551 bp), and a small single copy (SSC 18,977 bp). There were 131 genes identified, including 37 tRNA, 8 rRNA, and 86 mRNA genes. Phylogenetic analysis of 23 species of Fagaceae indicated that Lithocarpus is monophyletic with strong bootstrap, and L. litseifolius is genetically closely related to Lithocarpus polystachyus

    Lightweight, robust hierarchically porous ceramics from cost-effective powders for dye removal

    No full text
    A facile and cost-effective approach for hierarchically porous ceramics has been proposed firstly, using coal fly ash as the main powder while secondary aluminum dross (SAD) as curing and foaming agents. Benefited from the hydrolysis of AlN and Al in SAD, gas and lamellar aluminum hydroxide sol attached to the surface of unreacted particles were produced, meanwhile pH value and viscosity increased, realizing foaming and coagulation casting simultaneously to fabricate porous green body. The increase of SAD amount could contribute to more intensive formation of flaky aluminum hydroxide sol, enhancing mechanical strength of green body, which has high compressive strength of 1.03 MPa at a bulk density of 0.68 g/cm3. Porous mullite-based ceramics could be sintered at a relatively low temperature of 1200–1250 °C owing to alkali metal oxide and amorphous SiO2 in the solid wastes, which also help to form micropores on the pore wall due to partially sintering. Porous mullite-based ceramics possess hierarchical pores, bulk density of 1.00 g/cm3, open porosity of 66.15%, specific surface area of 2.24 m2/g and compressive strength of 9.14 MPa. Their removal rate of malachite green solution reaches 99.1% at a concentration of 100 mg/L and an adsorbent dosage of 100 g/L

    yaSpMV

    No full text

    An Efficient Leaching of Palladium from Spent Catalysts through Oxidation with Fe(III)

    Get PDF
    Reclamation of spent catalysts for the efficient recovery of palladium (Pd) is gaining growing attention due to its scarcity and high supply risk. Currently Pd extraction from spent catalysts through an efficient, economical, and green method has remained a challenge. In this study, Fe3+ is utilized for leaching through oxidation of Pd in a mild condition. Before leaching, distillation was proposed to remove and recover the organics from spent catalysts. The effects of HCl concentration, Fe3+ concentration, NaCl concentration, leaching time, and temperature on the leaching efficiency of Pd were investigated to determine the optimum leaching conditions. The results show that Pd extraction and dissolution of Al2O3 increase with higher HCl concentration. The effect of NaCl on Pd leaching efficiency is significant at low acid concentration (2.0 mol/L HCl). The leaching efficiency was 99.5% for Pd under the following conditions: 2.0 mol/L HCl, 4.0 mol/L NaCl, and 0.67 mol/L Fe3+ at 80 °C for 90 min. The leaching kinetics fits best to the shrinking-core model of surface chemical reaction. The activation energy for the leaching of Pd was 47.6 kJ/mol. PdCl42− was selectively adsorbed by anion exchange resin. The filtrate containing adequate H+, Cl-, and Fe3+ was reused as leaching agent. Pd leaching efficiency was over 96% after five cycle times. This study provides an efficient process for recovery of Pd from spent catalysts

    gpuroofline: a model for guiding performance optimizations on gpus

    No full text
    Performance optimization on GPUs requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem. This paper presents GPURoofline, an empirical model for guiding optimizations on GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels. The model addresses this problem by exploring potential performance bottlenecks and evaluating whether specific optimization techniques bring any performance improvement. To demonstrate the usage of the model, we optimize four representative kernels with different computation densities, namely matrix transpose, Laplace transform, integral and face-dection, on both NVIDIA and AMD GPUs. Experimental results show that under the guidance of GPURoofline, performance of those kernels achieves 3.74~14.8 times speedup compared to their nai¨ve implementations on both NVIDIA and AMD GPU platforms. © 2012 Springer-Verlag.Performance optimization on GPUs requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem. This paper presents GPURoofline, an empirical model for guiding optimizations on GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels. The model addresses this problem by exploring potential performance bottlenecks and evaluating whether specific optimization techniques bring any performance improvement. To demonstrate the usage of the model, we optimize four representative kernels with different computation densities, namely matrix transpose, Laplace transform, integral and face-dection, on both NVIDIA and AMD GPUs. Experimental results show that under the guidance of GPURoofline, performance of those kernels achieves 3.74~14.8 times speedup compared to their nai¨ve implementations on both NVIDIA and AMD GPU platforms. © 2012 Springer-Verlag
    corecore