444 research outputs found

    Performance and Microarchitectural Analysis for Image Quality Assessment

    Get PDF
    This thesis presents performance analysis for five matured Image Quality Assessment algorithms: VSNR, MAD, MSSIM, BLIINDS, and VIF, using the VTune ... from Intel. The main performance parameter considered is execution time. First, we conduct Hotspot Analysis to find the most time consuming sections for the five algorithms. Second, we perform Microarchitecural Analysis to analyze the behavior of the algorithms for Intel's Sandy Bridge microarchitecture to find architectural bottlenecks. Existing research for improving the performance of IQA algorithms is based on advanced signal processing techniques. Our research focuses on the interaction of IQA algorithms with the underlying hardware and architectural resources. We propose techniques to improve performance using coding techniques that exploit the hardware resources and consequently improve the execution time and computational performance. Along with software tuning methods, we also propose a generic custom IQA hardware engine based on the microarchitectural analysis and the behavior of these five IQA algorithms with the underlying microarchitectural resources.School of Electrical & Computer Engineerin

    Analysis and Performance Optimization of a GPGPU Implementation of Image Quality Assessment (IQA) Algorithm VSNR

    Get PDF
    abstract: Image processing has changed the way we store, view and share images. One important component of sharing images over the networks is image compression. Lossy image compression techniques compromise the quality of images to reduce their size. To ensure that the distortion of images due to image compression is not highly detectable by humans, the perceived quality of an image needs to be maintained over a certain threshold. Determining this threshold is best done using human subjects, but that is impractical in real-world scenarios. As a solution to this issue, image quality assessment (IQA) algorithms are used to automatically compute a fidelity score of an image. However, poor performance of IQA algorithms has been observed due to complex statistical computations involved. General Purpose Graphics Processing Unit (GPGPU) programming is one of the solutions proposed to optimize the performance of these algorithms. This thesis presents a Compute Unified Device Architecture (CUDA) based optimized implementation of full reference IQA algorithm, Visual Signal to Noise Ratio (VSNR) that uses M-level 2D Discrete Wavelet Transform (DWT) with 9/7 biorthogonal filters among other statistical computations. The presented implementation is tested upon four different image quality databases containing images with multiple distortions and sizes ranging from 512 x 512 to 1600 x 1280. The CUDA implementation of VSNR shows a speedup of over 32x for 1600 x 1280 images. It is observed that the speedup scales with the increase in size of images. The results showed that the implementation is fast enough to use VSNR on high definition videos with a frame rate of 60 fps. This work presents the optimizations made due to the use of GPU’s constant memory and reuse of allocated memory on the GPU. Also, it shows the performance improvement using profiler driven GPGPU development in CUDA. The presented implementation can be deployed in production combined with existing applications.Dissertation/ThesisMasters Thesis Computer Science 201

    Strategies for improving efficiency and efficacy of image quality assessment algorithms

    Get PDF
    Image quality assessment (IQA) research aims to predict the qualities of images in a manner that agrees with subjective quality ratings. Over the last several decades, the major impetus in IQA research has focused on improving prediction efficacy globally (across images) of distortion-specific types or general types; very few studies have explored local image quality (within images), or IQA algorithm for improved JPEG2000 coding. Even fewer studies have focused on analyzing and improving the runtime performance of IQA algorithms. Moreover, reduced-reference (RR) IQA is also a new field to be explored, when the transmitting bandwidth is limited, side information about original image was received with distorted image at the receiver. This report explored these four topics. For local image quality, we provided a local sharpness database, and we analyzed the database along with current sharpness metrics. We revealed that human highly agreed when rating sharpness of small blocks. Overall, this sharpness database is a true representation of human subjective ratings and current sharpness algorithms could reach 0.87 in terms of SROCC score. For JPEG2000 coding using IQA, we provided a new JPEG2000 image database, which includes only same total distortion images. Analysis of existing IQA algorithms on this database revealed that even though current algorithms perform reasonably well on JPEG2000-compressed images in popular image-quality databases, they often fail to predict the correct rankings on our database's images. Based on the framework of Most Apparent Distortion (MAD), a new algorithm, MADDWT is then proposed using local DWT coefficient statistics to predict the perceived distortion due to subband quantization. MADDWT outperforms all others algorithms on this database, and shows a promising use in JPEG2000 coding. For efficiency of IQA algorithms, this paper is the first to examine IQA algorithms from the perspective of their interaction with the underlying hardware and microarchitectural resources, and to perform a systematic performance analysis using state-of-the-art tools and techniques from other computing disciplines. We implemented four popular full-reference IQA algorithms and two no-reference algorithms in C++ based on the code provided by their respective authors. Hotspot analysis and microarchitectural analysis of each algorithm were performed and compared. Despite the fact that all six algorithms share common algorithmic operations (e.g., filterbanks and statistical computations), our results revealed that different IQA algorithms overwhelm different microarchitectural resources and give rise to different types of bottlenecks. For RR IQA, we also provide a new framework based on multiscale sharpness map. This framework employs multiscale sharpness maps as reduced information. As we will demonstrate, our framework with 2% reduced information can outperform other frameworks, which employ from 2% to 3% reduced information. Our framework is also competitive to current state-of-the-art FR algorithms

    Best practices for HPM-assisted performance engineering on modern multicore processors

    Full text link
    Many tools and libraries employ hardware performance monitoring (HPM) on modern processors, and using this data for performance assessment and as a starting point for code optimizations is very popular. However, such data is only useful if it is interpreted with care, and if the right metrics are chosen for the right purpose. We demonstrate the sensible use of hardware performance counters in the context of a structured performance engineering approach for applications in computational science. Typical performance patterns and their respective metric signatures are defined, and some of them are illustrated using case studies. Although these generic concepts do not depend on specific tools or environments, we restrict ourselves to modern x86-based multicore processors and use the likwid-perfctr tool under the Linux OS.Comment: 10 pages, 2 figure

    Determinants of the Mechanical Behavior of Human Lumbar Vertebrae After Simulated Mild Fracture

    Get PDF
    The ability of a vertebra to carry load after an initial deformation and the determinants of this postfracture load-bearing capacity are critical but poorly understood. This study aimed to determine the mechanical behavior of vertebrae after simulated mild fracture and to identify the determinants of this postfracture behavior. Twenty-one human L3 vertebrae were analyzed for bone mineral density (BMD) by dual-energy X-ray absorptiometry (DXA) and for microarchitecture by micro–computed tomography (µCT). Mechanical testing was performed in two phases: initial compression of vertebra to 25% deformity, followed, after 30 minutes of relaxation, by a similar test to failure to determine postfracture behavior. We assessed (1) initial and postfracture mechanical parameters, (2) changes in mechanical parameters, (3) postfracture elastic behavior by recovery of vertebral height after relaxation, and (4) postfracture plastic behavior by residual strength and stiffness. Postfracture failure load and stiffness were 11% ± 19% and 53% ± 18% lower than initial values (p = .021 and p < .0001, respectively), with 29% to 69% of the variation in the postfracture mechanical behavior explained by the initial values. Both initial and postfracture mechanical behaviors were significantly correlated with bone mass and microarchitecture. Vertebral deformation recovery averaged 31% ± 7% and was associated with trabecular and cortical thickness (r = 0.47 and r = 0.64; p = .03 and p = .002, respectively). Residual strength and stiffness were independent of bone mass and initial mechanical behavior but were related to trabecular and cortical microarchitecture (|r| = 0.50 to 0.58; p = .02 to .006). In summary, we found marked variation in the postfracture load-bearing capacity following simulated mild vertebral fractures. Bone microarchitecture, but not bone mass, was associated with postfracture mechanical behavior of vertebrae. © 2011 American Society for Bone and Mineral Research

    Design of secure and robust cognitive system for malware detection

    Full text link
    Machine learning based malware detection techniques rely on grayscale images of malware and tends to classify malware based on the distribution of textures in graycale images. Albeit the advancement and promising results shown by machine learning techniques, attackers can exploit the vulnerabilities by generating adversarial samples. Adversarial samples are generated by intelligently crafting and adding perturbations to the input samples. There exists majority of the software based adversarial attacks and defenses. To defend against the adversaries, the existing malware detection based on machine learning and grayscale images needs a preprocessing for the adversarial data. This can cause an additional overhead and can prolong the real-time malware detection. So, as an alternative to this, we explore RRAM (Resistive Random Access Memory) based defense against adversaries. Therefore, the aim of this thesis is to address the above mentioned critical system security issues. The above mentioned challenges are addressed by demonstrating proposed techniques to design a secure and robust cognitive system. First, a novel technique to detect stealthy malware is proposed. The technique uses malware binary images and then extract different features from the same and then employ different ML-classifiers on the dataset thus obtained. Results demonstrate that this technique is successful in differentiating classes of malware based on the features extracted. Secondly, I demonstrate the effects of adversarial attacks on a reconfigurable RRAM-neuromorphic architecture with different learning algorithms and device characteristics. I also propose an integrated solution for mitigating the effects of the adversarial attack using the reconfigurable RRAM architecture.Comment: arXiv admin note: substantial text overlap with arXiv:2104.0665

    Role of Trabecular Microarchitecture and Its Heterogeneity Parameters in the Mechanical Behavior of Ex Vivo Human L3 Vertebrae

    Get PDF
    Low bone mineral density (BMD) is a strong risk factor for vertebral fracture risk in osteoporosis. However, many fractures occur in people with moderately decreased or normal BMD. Our aim was to assess the contributions of trabecular microarchitecture and its heterogeneity to the mechanical behavior of human lumbar vertebrae. Twenty-one human L3 vertebrae were analyzed for BMD by dual-energy X-ray absorptiometry (DXA) and microarchitecture by high-resolution peripheral quantitative computed tomography (HR-pQCT) and then tested in axial compression. Microarchitecture heterogeneity was assessed using two vertically oriented virtual biopsies—one anterior (Ant) and one posterior (Post)—each divided into three zones (superior, middle, and inferior) and using the whole vertebral trabecular volume for the intraindividual distribution of trabecular separation (Tb.Sp*SD). Heterogeneity parameters were defined as (1) ratios of anterior to posterior microarchitectural parameters and (2) the coefficient of variation of microarchitectural parameters from the superior, middle, and inferior zones. BMD alone explained up to 44% of the variability in vertebral mechanical behavior, bone volume fraction (BV/TV) up to 53%, and trabecular architecture up to 66%. Importantly, bone mass (BMD or BV/TV) in combination with microarchitecture and its heterogeneity improved the prediction of vertebral mechanical behavior, together explaining up to 86% of the variability in vertebral failure load. In conclusion, our data indicate that regional variation of microarchitecture assessment expressed by heterogeneity parameters may enhance prediction of vertebral fracture risk. © 2010 American Society for Bone and Mineral Research

    Towards Accurate Run-Time Hardware-Assisted Stealthy Malware Detection: A Lightweight, yet Effective Time Series CNN-Based Approach

    Get PDF
    According to recent security analysis reports, malicious software (a.k.a. malware) is rising at an alarming rate in numbers, complexity, and harmful purposes to compromise the security of modern computer systems. Recently, malware detection based on low-level hardware features (e.g., Hardware Performance Counters (HPCs) information) has emerged as an effective alternative solution to address the complexity and performance overheads of traditional software-based detection methods. Hardware-assisted Malware Detection (HMD) techniques depend on standard Machine Learning (ML) classifiers to detect signatures of malicious applications by monitoring built-in HPC registers during execution at run-time. Prior HMD methods though effective have limited their study on detecting malicious applications that are spawned as a separate thread during application execution, hence detecting stealthy malware patterns at run-time remains a critical challenge. Stealthy malware refers to harmful cyber attacks in which malicious code is hidden within benign applications and remains undetected by traditional malware detection approaches. In this paper, we first present a comprehensive review of recent advances in hardware-assisted malware detection studies that have used standard ML techniques to detect the malware signatures. Next, to address the challenge of stealthy malware detection at the processor’s hardware level, we propose StealthMiner, a novel specialized time series machine learning-based approach to accurately detect stealthy malware trace at run-time using branch instructions, the most prominent HPC feature. StealthMiner is based on a lightweight time series Fully Convolutional Neural Network (FCN) model that automatically identifies potentially contaminated samples in HPC-based time series data and utilizes them to accurately recognize the trace of stealthy malware. Our analysis demonstrates that using state-of-the-art ML-based malware detection methods is not effective in detecting stealthy malware samples since the captured HPC data not only represents malware but also carries benign applications’ microarchitectural data. The experimental results demonstrate that with the aid of our novel intelligent approach, stealthy malware can be detected at run-time with 94% detection performance on average with only one HPC feature, outperforming the detection performance of state-of-the-art HMD and general time series classification methods by up to 42% and 36%, respectively

    Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units

    Get PDF
    A key challenge in modern computing systems is to access data fast enough to fully utilize the computing elements in the chip. In Graphics Processing Units (GPUs), the performance is often constrained by register file size, memory bandwidth, and the capacity of the main memory. One important technique towards alleviating this challenge is data compression. By reducing the amount of data that needs to be communicated or stored, memory resources crucial for performance can be efficiently utilized.This thesis provides a set of approximation and compression techniques for GPUs, with the goal of efficiently utilizing the computational fabric, and thereby increase performance. The thesis shows that these techniques can substantially lower the amount of information the system has to process, and are thus important tools in the process of meeting challenges in memory utilization.This thesis makes contributions within three areas: controlled floating-point precision reduction, lossless and lossy memory compression, and distributed training of neural networks. In the first area, the thesis shows that through automated and controlled floating-point approximation, the register file can be more efficiently utilized. This is achieved through a framework which establishes a cross-layer connection between the application and the microarchitecture layer, and a novel register file organization capable of leveraging low-precision floating-point values and narrow integers for increased capacity and performance.Within the area of compression, this thesis aims at increasing the effective bandwidth of GPUs by presenting a lossless and lossy memory compression algorithm to reduce the amount of transferred data. In contrast to state-of-the-art compression techniques such as Base-Delta-Immediate and Bitplane Compression, which uses intra-block bases for compression, the proposed algorithm leverages multiple global base values to reach a higher compression ratio. The algorithm includes an optional approximation step for floating-point values which offers higher compression ratio at a given, low, error rate.Finally, within the area of distributed training of neural networks, this thesis proposes a subgraph approximation scheme for graph data which mitigates accuracy loss in a distributed setting. The scheme allows neural network models that use graphs as inputs to converge at single-machine accuracy, while minimizing synchronization overhead between the machines
    • …
    corecore