55 research outputs found

    Proteus: Simulating the Performance of Distributed DNN Training

    Full text link
    DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this paper, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0%3.0\% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%133.8\%

    Regenerated woody plants influence soil microbial communities in a subtropical forest

    Get PDF
    10 páginas.- 4 figuras.- 3 tablas.- referencias.- upplementary data to this article can be found online at https://doi. org/10.1016/j.apsoil.2023.104890Forests are critical for supporting multiple ecosystem services such as climate change mitigation. Microbial diversity in soil provides important functions to maintain and regenerate forest ecosystems, and yet a critical knowledge gap remains in identifying the linkage between attributes of regenerated woody plant (RWP) communities and the diversity patterns of soil microbial communities in subtropical plantations. Here, we investigated the changes in soil microbial communities and plant traits in a nine hectare Chinese fir (Cunninghamia lanceolata; CF) plantation to assess how non-planted RWP communities regulate soil bacterial and fungal diversity, and further explore the potential mechanisms that structure their interaction. Our study revealed that soil bacterial richness was positively associated with RWP richness, whereas soil fungal richness was negatively associated with RWP basal area. Meanwhile, RWP richness was positively correlated with ectomycorrhizal (ECM) fungal richness but negatively correlated with the richness of both pathogenic and saprotrophic fungi, suggesting that the RWP-fungal richness relationship was trophic guild-specific. Soil microbial community beta diversity (i.e., dissimilarity in community composition) was strongly coupled with both RWP beta diversity and the heterogeneity of RWP basal area. Our study highlights the importance of community-level RWP plant attributes for the regulation of microbial biodiversity in plantation systems, which should be considered in forest management programs in the future.This work was funded by the National Key Research and Development Program of China (2021YFD2201301 and 2022YFF1303003), the National Natural Science Foundation of China (U22A20612), and the Key Project of Jiangxi Province Natural Science Foundation of China (20224ACB205003).Peer reviewe

    The Advancement of 7XXX Series Aluminum Alloys for Aircraft Structures: A Review

    No full text
    7XXX series aluminum alloys (Al 7XXX alloys) are widely used in bearing components, such as aircraft frame, spars and stringers, for their high specific strength, high specific stiffness, high toughness, excellent processing, and welding performance. Therefore, Al 7XXX alloys are the most important structural materials in aviation. In this present review, the development tendency and the main applications of Al 7XXX alloys for aircraft structures are introduced, and the existing problems are simply discussed. Also, the heat treatment processes for improving the properties are compared and analyzed. It is the most important measures that optimizing alloy composition and improving heat treatment process are to enhance the comprehensive properties of Al 7XXX alloys. Among the method, solid solution, quenching, and aging of Al 7XXX alloys are the most significant. We introduce the effects of the three methods on the properties, and forecast the development direction of the properties, compositions, and heat treatments and the solution to the corrosion prediction problem for the next generation of Al 7XXX alloys for aircraft structures. The next generation of Al 7XXX alloys should be higher strength, higher toughness, higher damage tolerance, higher hardenability, and better corrosion resistance. It is urgent requirements to develop or invent new heat treatment regime. We should construct a novel corrosion prediction model for Al 7XXX alloys via confirming the surface corrosion environments and selecting the accurate and reliable electrochemical measurements

    Harmless disposal and resource utilization for secondary aluminum dross: A review

    No full text
    Secondary aluminum dross (SAD) is solid waste of primary aluminum dross extracted aluminum, which contains approximately 40–60 wt% alumina, 10–30 wt% aluminum nitride (AlN), 5–15 wt% salts and other components. The salts include sodium chloride, potassium chloride and fluorine salts. SAD has dual attributes as resource and pollutant. SAD landfill disposal has the disadvantages of occupying land, wasting resources, a high cost and great environmental impact. SAD utilization methods are currently pyrometallurgy and hydrometallurgy. In pyrometallurgy, AlN is oxidized and the salts are evaporated at high temperature. After mixing, molding and calcination, firebricks and ceramics can be manufactured from SAD. In hydrometallurgy, AlN is hydrolyzed and salts are dissolved in water. After dissolving, filtrating, precipitating, washing and calcination, γ-Al2O3 can be prepared from SAD. Resource consumption and emission from both utilization methods were assessed. A ton of magnesium aluminum titanate based ceramics by pyrometallurgy consumes 1043 kg raw materials and releases 69 kg of waste gas, 4.17 t of waste water and no solid waste. A ton of γ-Al2O3 by hydrometallurgy consumes 3389 kg raw materials and releases 111 kg of waste gas, 12.98 t of waste water and 267 kg of solid waste. Therefore, the resource consumption and emission of SAD utilization by pyrometallurgy is lower than that by hydrometallurgy. We should focus on reducing the emission of the three wastes from pyrometallurgy. We are sure that SAD can be utilized for glass ceramics by pyrometallurgy. AlN and salts can be transformed into alumina and glass phases at high temperature with no emission. We should clarify mechanisms for SAD composition adjustment to lower the glass ceramics\u27 melting point, AlN and salts transformed into alumina and glass phases respectively, and nucleation and crystal growth of glass ceramics at high temperature

    YaSpMV: Yet another SpMV framework on GPUs

    No full text
    SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM

    Parallelization and performance optimization on face detection algorithm with OpenCL: A case study

    No full text

    an insightful program performance tuning chain for gpu computing

    No full text
    It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. This paper presents an insightful performance tuning chain for GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels directly. We achieve it by providing performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted, so as to facilitate the best match between algorithm features and underlying hardware characteristics. To demonstrate the usage of tuning chain, we optimize three representative GPU kernels with different compute intensity: Matrix Transpose, Laplace Transform and Integral on both NVIDIA and AMD GPUs. Experimental results demonstrate that under the guidance of our tuning chain, performance of those kernels achieves 7.8~42.4 times speedup compared to their nai¨ve implementations on both NVIDIA and AMD GPU platforms. © 2012 Springer-Verlag.It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. This paper presents an insightful performance tuning chain for GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels directly. We achieve it by providing performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted, so as to facilitate the best match between algorithm features and underlying hardware characteristics. To demonstrate the usage of tuning chain, we optimize three representative GPU kernels with different compute intensity: Matrix Transpose, Laplace Transform and Integral on both NVIDIA and AMD GPUs. Experimental results demonstrate that under the guidance of our tuning chain, performance of those kernels achieves 7.8~42.4 times speedup compared to their nai¨ve implementations on both NVIDIA and AMD GPU platforms. © 2012 Springer-Verlag

    parallelization and performance optimization on face detection algorithm with opencl: a case study

    No full text
    Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU platforms without bringing up new algorithms. This paper presents the bottleneck of this application and discusses how to optimize the face detection step by step from a very nave implementation. Some brilliant tricks and methods like CPU execution time hidden, stubbles usage of local memory as high speed scratchpad and manual cache, and variable granularity were used to improve the performance. Those technologies result in 4-13 times speedup varying with the image size. Furthermore, those ideas may throw on some light on the way to parallel applications efficiently with OpenCL. Taking face detection as an example, this paper also summarizes some universal advice on how to optimize OpenCL program, trying to help other applications do better on GPU. © 2012 Tsinghua University Press.Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU platforms without bringing up new algorithms. This paper presents the bottleneck of this application and discusses how to optimize the face detection step by step from a very nave implementation. Some brilliant tricks and methods like CPU execution time hidden, stubbles usage of local memory as high speed scratchpad and manual cache, and variable granularity were used to improve the performance. Those technologies result in 4-13 times speedup varying with the image size. Furthermore, those ideas may throw on some light on the way to parallel applications efficiently with OpenCL. Taking face detection as an example, this paper also summarizes some universal advice on how to optimize OpenCL program, trying to help other applications do better on GPU. © 2012 Tsinghua University Press

    Preparation and hydration of industrial solid waste—cement blends: A review

    No full text
    Industrial solid waste (ISW)—cement blends have the advantages of low carbon, low energy consumption, and low pollution, but their clinker replacement level in low carbon cement is generally low. To address this challenge, this study considers the latest progress and development trends in the ISW—cement blend research, focusing on the activation of ISWs, the formation of ISW—cement blends, and their associated hydration mechanisms. After the mechanical activation of ISWs, the D50 (average size) typically drops below 10 \ub5m, and the specific surface area increases above 350 m2/kg. Thermal activation can increase the glassy-phase content and reactivity of ISWs, where the coal gangue activation temperature is usually set at 400–1000\ub0C. Furthermore, the roles of ISWs in the hydration of ISW—cement blends are divided into physical and chemical roles. The physical action of ISWs usually acts in the early stage of the hydration of ISW—cement blends. Subsequently, ISWs participate in the hydration reaction of ISW—cement blends to generate products, such as C—(A)—S—H gels. Moreover, alkali activation affects the hydration kinetics of ISW—cement blends and modifies the proportion of gels. Environmental impacts and costs of ISW—cement blends have also been discussed to guide stakeholders in selecting sustainable ISWs

    Phase evolution and properties of glass ceramic foams prepared by bottom ash, fly ash and pickling sludge

    No full text
    Municipal solid waste incineration products of bottom ash (BA), fly ash (FA), and pickling sludge (PS), causing severe environmental pollution, were transformed into glass ceramic foams with the aid of CaCO3 as a pore-foaming agent during sintering. The effect of the BA/FA mass ratio on the phase composition, pore morphology, pore size distribution, physical properties, and glass structure was investigated, with results showing that with the increase in the BA/FA ratio, the content of the glass phase, Si-O-Si, and Q3Si units decrease gradually. The glass transmission temperature of the mixture was also reduced. When combined, the glass viscosity decreases, causing bubble coalescence and uneven pore distribution. Glass ceramic foams with uniform spherical pores are fabricated. When the content of BA, FA, and PS are 35wt%, 45wt%, and 20wt%, respectively, contributing to high performance glass ceramic foams with a bulk density of 1.76 g/cm3, porosity of 56.01%, and compressive strength exceeding 16.23 MPa. This versatile and low-cost approach provides new insight into synergistically recycling solid wastes
    • …
    corecore