39 research outputs found
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
Large language models (LLMs) have demonstrated state-of-the-art performance
across various tasks. However, the latency of inference and the large GPU
memory consumption of LLMs restrict their deployment performance. Recently,
there have been some efficient attempts to quantize LLMs, yet inference with
large batch size or long sequence still has the issue of being compute-bound.
Fine-grained quantization methods have showcased their proficiency in achieving
low-bit quantization for LLMs, while requiring FP16 data type for linear layer
computations, which is time-consuming when dealing with large batch size or
long sequence. In this paper, we introduce a method called FlattenQuant, which
significantly reduces the maximum value of the tensor by flattening the large
channels in the tensor, to achieve low bit per-tensor quantization with minimal
accuracy loss. Our experiments show that FlattenQuant can directly use 4 bits
to achieve 48.29% of the linear layer calculation in LLMs, with the remaining
layers using 8 bits. The 4-bit matrix multiplication introduced in the
FlattenQuant method can effectively address the compute-bound caused by large
matrix calculation. Our work achieves up to 2 speedup and 2.3
memory reduction for LLMs with negligible loss in accuracy
ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches
The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized.
In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber\u27s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods.Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput
ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches
The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized.
In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber’s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods. Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput
DeepOpht: Medical Report Generation for Retinal Images via Deep Models and Visual Explanation
In this work, we propose an AI-based method that intends to improve the
conventional retinal disease treatment procedure and help ophthalmologists
increase diagnosis efficiency and accuracy. The proposed method is composed of
a deep neural networks-based (DNN-based) module, including a retinal disease
identifier and clinical description generator, and a DNN visual explanation
module. To train and validate the effectiveness of our DNN-based module, we
propose a large-scale retinal disease image dataset. Also, as ground truth, we
provide a retinal image dataset manually labeled by ophthalmologists to
qualitatively show, the proposed AI-based method is effective. With our
experimental results, we show that the proposed method is quantitatively and
qualitatively effective. Our method is capable of creating meaningful retinal
image descriptions and visual explanations that are clinically relevant.Comment: Accepted to IEEE WACV 202
Efficacy and safety of the compound Chinese medicine SaiLuoTong in vascular dementia: A randomized clinical trial
Introduction: No licensed medications are available to treat vascular dementia (VaD).
Methods: Patients were randomly assigned to experimental groups (SaiLuoTong [SLT] 360 or 240 mg for groups A and B for 52 weeks, respectively) or placebo group (SLT 360 mg and 240 mg for group C only from weeks 27 to 52, respectively).
Results: Three hundred twenty-five patients were included in final analysis. At week 26, the difference in VaD Assessment Scale-cognitive subscale scores was 2.67 (95% confidence interval, 1.54 to 3.81) for groups A versus C, and 2.48 (1.34 to 3.62) for groups B versus C (both
Discussion: This study suggests that SLT is effective for treatment of VaD, and this compound Chinese medicine may represent a better choice to treat VaD
Mechanisms and recent advances in the diagnosis and treatment of nitrous oxide-induced peripheral neuropathy: a narrative review
Under standard conditions, nitrous oxide (N2O) manifests as a colorless, odorless gas with a mildly sweet taste. The compound finds applications in various fields, including its use as an aerosol propellants, an accelerant in motor racing, and an anesthetic in surgical procedures and dentistry. Unfortunately, the recreational misuse of N2O has become prevalent among young individuals due to its euphoric and hallucinogenic effects. Compounding this issue is the fact that nitrous oxide can be easily obtained from over-the-counter household items, facilitating its non-medical use. The global community has witnessed a surge in the recreational utilization of nitrous oxide gas in recent years. Despite the widespread non-medical abuse of N2O, there remains inadequate understanding of the potential adverse effects resulting from exposure to it. This paper provides an overview of management findings, laboratory and electrodiagnostic characteristics, as well as clinical presentations associated with neurological disorders induced by nitrous oxide usage