39 research outputs found

    FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

    Full text link
    Large language models (LLMs) have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have been some efficient attempts to quantize LLMs, yet inference with large batch size or long sequence still has the issue of being compute-bound. Fine-grained quantization methods have showcased their proficiency in achieving low-bit quantization for LLMs, while requiring FP16 data type for linear layer computations, which is time-consuming when dealing with large batch size or long sequence. In this paper, we introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss. Our experiments show that FlattenQuant can directly use 4 bits to achieve 48.29% of the linear layer calculation in LLMs, with the remaining layers using 8 bits. The 4-bit matrix multiplication introduced in the FlattenQuant method can effectively address the compute-bound caused by large matrix calculation. Our work achieves up to 2×\times speedup and 2.3×\times memory reduction for LLMs with negligible loss in accuracy

    ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

    Get PDF
    The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized. In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber\u27s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods.Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput

    ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

    Get PDF
    The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized. In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber’s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods. Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput

    DeepOpht: Medical Report Generation for Retinal Images via Deep Models and Visual Explanation

    Full text link
    In this work, we propose an AI-based method that intends to improve the conventional retinal disease treatment procedure and help ophthalmologists increase diagnosis efficiency and accuracy. The proposed method is composed of a deep neural networks-based (DNN-based) module, including a retinal disease identifier and clinical description generator, and a DNN visual explanation module. To train and validate the effectiveness of our DNN-based module, we propose a large-scale retinal disease image dataset. Also, as ground truth, we provide a retinal image dataset manually labeled by ophthalmologists to qualitatively show, the proposed AI-based method is effective. With our experimental results, we show that the proposed method is quantitatively and qualitatively effective. Our method is capable of creating meaningful retinal image descriptions and visual explanations that are clinically relevant.Comment: Accepted to IEEE WACV 202

    Efficacy and safety of the compound Chinese medicine SaiLuoTong in vascular dementia: A randomized clinical trial

    Get PDF
    Introduction: No licensed medications are available to treat vascular dementia (VaD). Methods: Patients were randomly assigned to experimental groups (SaiLuoTong [SLT] 360 or 240 mg for groups A and B for 52 weeks, respectively) or placebo group (SLT 360 mg and 240 mg for group C only from weeks 27 to 52, respectively). Results: Three hundred twenty-five patients were included in final analysis. At week 26, the difference in VaD Assessment Scale-cognitive subscale scores was 2.67 (95% confidence interval, 1.54 to 3.81) for groups A versus C, and 2.48 (1.34 to 3.62) for groups B versus C (both Discussion: This study suggests that SLT is effective for treatment of VaD, and this compound Chinese medicine may represent a better choice to treat VaD

    Mechanisms and recent advances in the diagnosis and treatment of nitrous oxide-induced peripheral neuropathy: a narrative review

    Get PDF
    Under standard conditions, nitrous oxide (N2O) manifests as a colorless, odorless gas with a mildly sweet taste. The compound finds applications in various fields, including its use as an aerosol propellants, an accelerant in motor racing, and an anesthetic in surgical procedures and dentistry. Unfortunately, the recreational misuse of N2O has become prevalent among young individuals due to its euphoric and hallucinogenic effects. Compounding this issue is the fact that nitrous oxide can be easily obtained from over-the-counter household items, facilitating its non-medical use. The global community has witnessed a surge in the recreational utilization of nitrous oxide gas in recent years. Despite the widespread non-medical abuse of N2O, there remains inadequate understanding of the potential adverse effects resulting from exposure to it. This paper provides an overview of management findings, laboratory and electrodiagnostic characteristics, as well as clinical presentations associated with neurological disorders induced by nitrous oxide usage
    corecore