4 research outputs found

    Avoiding conversion and rearrangement overhead in SIMD architecures

    No full text
    In this dissertation, a novel SIMD extension called Modified MMX (MMMX) for multimedia computing is presented. Specifically, the MMX architecture is enhanced with the extended subwords and the matrix register file techniques. The extended subwords technique uses SIMD registers that are wider than the packed format used to store the data. The extended subwords technique avoids data type conversion overhead and increases parallelism in SIMD architectures. This is because promoting the subwords of the source SIMD registers to larger subwords before they can be processed and demoting the results again before they can be written back to memory incurs conversion overhead. The matrix register file technique allows to load data that is stored consecutively in memory into a column of the register file, where a column corresponds to the corresponding subwords of different registers. In other words, this technique provides both row-wise as well as column-wise accesses to the media register file. It is a useful approach for matrix operations that are common in multimedia processing. In addition, in this work, new and general SIMD instructions addressing the multimedia application domain are investigated. It does not consider an ISA that is application specific. For example, special-purpose instructions are synthesized using a few general-purpose SIMD instructions. The performance of the MMMX architecture is compared to the performance of the MMX/SSE architecture for different multimedia applications and kernels using the sim-outorder simulator of the SimpleScalar toolset. Additionally, three issues related to the efficient implementation of the 2D Discrete Wavelet Transform (DWT)on general-purpose processors, in particular the Pentium 4, are discussed. These are 64K aliasing, cache conflict misses, and SIMD vectorization. 64K aliasing is a phenomenon that happens on the Pentium 4, which can degrade performance by an order of magnitude. It occurs if two or more data items whose addresses differ by a multiple of 64K need to be cached simultaneously. There are also many cache conflict misses in the implementation of vertical filtering of the DWT, if the filter length exceeds the number of cache ways. In this dissertation, techniques are proposed to avoid 64K aliasing and to mitigate cache conflict misses. Furthermore, the performance of the 2D DWT is improved by exploiting the data-level parallelism using the SIMD instructions supported by most general-purpose processors.Electrical Engineering, Mathematics and Computer Scienc

    Parallel implementation of Gray Level Co-occurrence Matrices and Haralick texture features on cell architecture

    No full text
    Texture features extraction algorithms are key functions in various image processing applications such as medical images, remote sensing, and content-based image retrieval. The most common way to extract texture features is the use of Gray Level Co-occurrence Matrices (GLCMs). The GLCM contains the second-order statistical information of spatial relationship of the pixels of an image. Haralick texture features are extracted using these GLCMs. However, the GLCMs and Haralick texture features extraction algorithms are computationally intensive. In this paper, we apply different parallel techniques such as task- and data-level parallelism to exploit available parallelism of those applications on the Cell multi-core processor. Experimental results have shown that our parallel implementations using 16 Synergistic Processor Elements significantly reduce the computational times of the GLCMs and texture features extraction algorithms by a factor of 10× over non-parallel optimized implementations for different image sizes from 128×128 to 1024×1024.Software Computer TechnologyElectrical Engineering, Mathematics and Computer Scienc

    Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

    No full text
    The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying loop interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.Microelectronics & Computer EngineeringElectrical Engineering, Mathematics and Computer Scienc

    A fuzzy fine-tuned model for COVID-19 diagnosis

    No full text
    The COVID-19 disease pandemic spread rapidly worldwide and caused extensive human death and financial losses. Therefore, finding accurate, accessible, and inexpensive methods for diagnosing the disease has challenged researchers. To automate the process of diagnosing COVID-19 disease through images, several strategies based on deep learning, such as transfer learning and ensemble learning, have been presented. However, these techniques cannot deal with noises and their propagation in different layers. In addition, many of the datasets already being used are imbalanced, and most techniques have used binary classification, COVID-19, from normal cases. To address these issues, we use the blind/referenceless image spatial quality evaluator to filter out inappropriate data in the dataset. In order to increase the volume and diversity of the data, we merge two datasets. This combination of two datasets allows multi-class classification between the three states of normal, COVID-19, and types of pneumonia, including bacterial and viral types. A weighted multi-class cross-entropy is used to reduce the effect of data imbalance. In addition, a fuzzy fine-tuned Xception model is applied to reduce the noise propagation in different layers. Quantitative analysis shows that our proposed model achieves 96.60% accuracy on the merged test set, which is more accurate than previously mentioned state-of-the-art methods.Computer EngineeringQuantum Circuit Architectures and Technolog
    corecore