687 research outputs found

    Mixed-Precision Random Projection for RandNLA on Tensor Cores

    Full text link
    Random projection can reduce the dimension of data while capturing its structure and is a fundamental tool for machine learning, signal processing, and information retrieval, which deal with a large amount of data today. RandNLA (Randomized Numerical Linear Algebra) leverages random projection to reduce the computational complexity of low-rank decomposition of tensors and solve least-square problems. While the computation of the random projection is a simple matrix multiplication, its asymptotic computational complexity is typically larger than other operations in a RandNLA algorithm. Therefore, various studies propose methods for reducing its computational complexity. We propose a fast mixed-precision random projection method on NVIDIA GPUs using Tensor Cores for single-precision tensors. We exploit the fact that the random matrix requires less precision, and develop a highly optimized matrix multiplication between FP32 and FP16 matrices -- SHGEMM (Single and Half-precision GEMM) -- on Tensor Cores, where the random matrix is stored in FP16. Our method can compute Randomized SVD 1.28 times faster and Random projection high order SVD 1.75 times faster than baseline single-precision implementations while maintaining accuracy.Comment: PASC'2

    Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

    Full text link
    NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.Comment: HPC Asia 202

    支那近代工業化を繞ぐる諸問題

    Get PDF

    旭川市商店街小賣商の一經營調査

    Get PDF

    DGEMM on Integer Matrix Multiplication Unit

    Full text link
    Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy

    Quantum Circuit Simulation by SGEMM Emulation on Tensor Cores and Automatic Precision Selection

    Full text link
    Quantum circuit simulation provides the foundation for the development of quantum algorithms and the verification of quantum supremacy. Among the various methods for quantum circuit simulation, tensor network contraction has been increasing in popularity due to its ability to simulate a larger number of qubits. During tensor contraction, the input tensors are reshaped to matrices and computed by a GEMM operation, where these GEMM operations could reach up to 90\% of the total calculation time. GEMM throughput can be improved by utilizing mixed-precision hardware such as Tensor Cores, but straightforward implementation results in insufficient fidelity for deep and large quantum circuits. Prior work has demonstrated that compensated summation with special care of the rounding mode can fully recover the FP32 precision of SGEMM even when using TF32 or FP16 Tensor Cores. The exponent range is a critical issue when applying such techniques to quantum circuit simulation. While TF32 supports almost the same exponent range as FP32, FP16 supports a much smaller exponent range. In this work, we use the exponent range statistics of input tensor elements to select which Tensor Cores we use for the GEMM. We evaluate our method on Random Circuit Sampling (RCS), including Sycamore's quantum circuit, and show that the throughput is 1.86 times higher at maximum while maintaining accuracy.Comment: This paper has been accepted to ISC'2

    Antibacterial Effects of Disulfiram in Helicobacter pylori

    Get PDF
    Background: Helicobacter pylori infection poses a risk of the occurrence of gastrointestinal diseases, such as gastric cancer. Its incidence rate is significantly reduced by eradication, and thereby, eradication therapy is generally performed. Disulfiram is an oral prescription drug mainly used for the treatment of alcohol dependence. In recent years, reports have been made on its anticancer and antibacterial effects, and thus, it has recently become an interesting subject. This study aimed to examine the antibacterial activity of disulfiram, investigate the presence or absence of its antibacterial activity on H. pylori, and determine whether it could be a new bactericidal drug against drug-resistant H. pylori. Materials and Methods: Drug-sensitive strains of H. pylori and amoxicillin-resistant, clarithromycin-resistant, and metronidazole-resistant strains were used, and a growth inhibition test of H. pylori using disulfiram was performed. Furthermore, the expression of urease, vacuolating cytotoxin A (VacA), and CagA, the virulence proteins of H. pylori, was quantitatively analyzed using the Western blotting method. In addition, for H. pylori used in this study, the 16SrDNA sequence, a ribosomal gene involved in protein production, was analyzed to examine the presence or absence of gene mutation. Results: Disulfiram suppressed the growth of 7 out of 12 H. pylori strains at 1 mu g/mL, and no correlation was observed between their susceptibility/resistance to current eradication antimicrobial drugs and disulfiram resistance. Disulfiram reduced the expression levels of urease, VacA, and CagA proteins. H. pylori, which showed resistance to disulfiram, tended to have fewer gene deletions/insertions in the 16S rDNA sequence; however, no specific mutation was detected. Conclusion: Disulfiram has a bactericidal effect on H. pylori at low concentrations, suggesting that it can be used as a supplement for current H. pylori eradication drugs

    Development of Remotely Operated Vehicle for Small-size Jellyfish Extermination and its Evaluation of Extermination Motion Control

    Get PDF
    In recent years, increase in the number of jellyfish has caused damage in the fishery and tourism industries. Therefore, the extermination work of jellyfish is being carried out by human hands. However, conventional methods for extermination are required a lot of time and manpower. In this paper, we propose a method for extermination work of jellyfish using underwater robot. Also, we introduce developed ROV type underwater robot, which is called J.E.N.O.S. (Jellyfish Extermination Nifty-robot for Ocean Sustentation), and its extermination motion control. The ROV is developed in consideration of the attitude control during the extermination operation. Because, the attitude, such as surge and pitch angle, of ROV becomes unstable when performing jellyfish extermination. Therefore, we equipped 8 thrusters to improve attitude stability during the jellyfish extermination. As a result, surge acceleration is reduced to about 30.0%, and pitch angle velocity is reduced to about 25.8%.The 2022 International Conference on Artificial Life and Robotics (ICAROB 2022), January 20-23, 2022, on line, Oita, Japa

    Clarithromycin Suppresses Human Respiratory Syncytial Virus Infection-Induced Streptococcus pneumoniae Adhesion and Cytokine Production in a Pulmonary Epithelial Cell Line

    Get PDF
    Human respiratory syncytial virus (RSV) sometimes causes acute and severe lower respiratory tract illness in infants and young children. RSV strongly upregulates proinflammatory cytokines and the platelet-activating factor (PAF) receptor, which is a receptor for Streptococcus pneumoniae, in the pulmonary epithelial cell line A549. Clarithromycin (CAM), which is an antimicrobial agent and is also known as an immunomodulator, significantly suppressed RSV-induced production of interleukin-6, interleukin-8, and regulated on activation, normal T-cell expressed and secreted (RANTES). CAM also suppressed RSV-induced PAF receptor expression and adhesion of fluorescein-labeled S. pneumoniae cells to A549 cells. The RSV-induced S. pneumoniae adhesion was thought to be mediated by the host cell's PAF receptor. CAM, which exhibits antimicrobial and immunomodulatory activities, was found in this study to suppress the RSV-induced adhesion of respiratory disease-causing bacteria, S. pneumoniae, to host cells. Thus, CAM might suppress immunological disorders and prevent secondary bacterial infections during RSV infection
    corecore