2,259 research outputs found

    Energy Efficient Computing on Multi-core Processors: Vectorization and Compression Techniques

    Get PDF
    Over the past few years, energy consumption has become the main limiting factor for computing in general. This has led CPU vendors to aggressively promote parallel computing using multiple cores without significantly increasing the thermal design power of the processor. However, achieving maximum performance and energy efficiency from the available resources on the multi-core and many-core platforms mandates efficient exploitation of the existing and emerging architectural features at the application level. This thesis presents the study of some of the existing and emerging technologies in order to identify the potential of exploiting these technologies in achieving high performance and energy efficiency for a set of Smart Grid applications on Intel multi-core and many-core platforms. The first part of this thesis explores the energy efficiency impact of different multi-core programming techniques for a selected set of benchmarks and smart grid applications on Intel SandyBridge and Haswell multi-core processors. These techniques include different parallelism techniques such as thread-level parallelism using OpenMP, task-based parallelism using OmpSs, data parallelism using SIMD (Single Instruction Multiple Data) instruction sets, code optimizations and use of different existing optimized math libraries. In our initial case studies, SIMD vectorization is proven very effective in providing both high performance and energy efficiency. Though the SIMD vectorization is proven very effective, it can also exert pressure on the available memory bandwidth for some applications like Powel Time-Series Kernel, causing under-utilization of the computing resources and thus energy inefficient executions. In the second part of this research, we investigate the opportunities of improving the performance of SIMD vectorization for memory-bound applications using SIMD data compression, SIMD software prefetching, SIMD shuffling, code-blocking and other code transformation techniques. The key idea is to reduce the data movement across memory hierarchy by using the idle CPU time. We show that integration of data compression is feasible on the Intel multicore platforms, as long as we can do it in a reasonable time. We present a comprehensive discussion on the SIMD compression techniques and the code transformations required for achieving efficient SIMD computations for memory/cache bound applications using Powel time series kernel as a demonstrator application. Finally, we perform feasibility study of SIMD optimization and compression techniques across other application domains using k-means clustering algorithm and full-search motion estimation algorithm. We also extended our experiments on Intel many-core architecture using Intel Xeon Phi coprocessor

    Grad Queue : A probabilistic framework to reinforce sparse gradients

    Full text link
    Informative gradients are often lost in large batch updates. We propose a robust mechanism to reinforce the sparse components within a random batch of data points. A finite queue of online gradients is used to determine their expected instantaneous statistics. We propose a function to measure the scarcity of incoming gradients using these statistics and establish the theoretical ground of this mechanism. To minimize conflicting components within large mini-batches, samples are grouped with aligned objectives by clustering based on inherent feature space. Sparsity is measured for each centroid and weighted accordingly. A strong intuitive criterion to squeeze out redundant information from each cluster is the backbone of the system. It makes rare information indifferent to aggressive momentum also exhibits superior performance with larger mini-batch horizon. The effective length of the queue kept variable to follow the local loss pattern. The contribution of our method is to restore intra-mini-batch diversity at the same time widening the optimal batch boundary. Both of these collectively drive it deeper towards the minima. Our method has shown superior performance for CIFAR10, MNIST, and Reuters News category dataset compared to mini-batch gradient descent.Comment: 15 pages, 6 figure

    TinyM2^2Net-V3: Memory-Aware Compressed Multimodal Deep Neural Networks for Sustainable Edge Deployment

    Full text link
    The advancement of sophisticated artificial intelligence (AI) algorithms has led to a notable increase in energy usage and carbon dioxide emissions, intensifying concerns about climate change. This growing problem has brought the environmental sustainability of AI technologies to the forefront, especially as they expand across various sectors. In response to these challenges, there is an urgent need for the development of sustainable AI solutions. These solutions must focus on energy-efficient embedded systems that are capable of handling diverse data types even in environments with limited resources, thereby ensuring both technological progress and environmental responsibility. Integrating complementary multimodal data into tiny machine learning models for edge devices is challenging due to increased complexity, latency, and power consumption. This work introduces TinyM2^2Net-V3, a system that processes different modalities of complementary data, designs deep neural network (DNN) models, and employs model compression techniques including knowledge distillation and low bit-width quantization with memory-aware considerations to fit models within lower memory hierarchy levels, reducing latency and enhancing energy efficiency on resource-constrained devices. We evaluated TinyM2^2Net-V3 in two multimodal case studies: COVID-19 detection using cough, speech, and breathing audios, and pose classification from depth and thermal images. With tiny inference models (6 KB and 58 KB), we achieved 92.95% and 90.7% accuracies, respectively. Our tiny machine learning models, deployed on resource limited hardware, demonstrated low latencies within milliseconds and very high power efficiency.Comment: Accepted at AAAI 2024 Workshop SA

    A vectorized k-means algorithm for compressed datasets: design and experimental analysis

    Get PDF
    Clustering algorithms (i.e., Gaussian mixture models, k-means) tackle the problem of grouping a set of elements in such a way that elements from the same group (or cluster) have more similar properties to each other than to those elements in other clusters. This simple concept turns out to be the basis in complex algorithms from many application areas, including sequence analysis and genotyping in bioinformatics, medical imaging, antimicrobial activity, market research, social networking, etc. However, as the data volume continues to increase, the performance of clustering algorithms is heavily influenced by the memory subsystem. In this paper, we propose a novel and efficient implementation of Lloyd’s k-means clustering algorithm to substantially reduce data movement along the memory hierarchy. Our contributions are based on the fact that the vast majority of processors are equipped with powerful Single Instruction Multiple Data (SIMD) instructions that are, in most cases, underused. SIMD improves the CPU computational power and, if used wisely, can be seen as an opportunity to improve on the application data transfers by compressing/decompressing the data, specially for memory-bound applications. Our contributions include a SIMD-friendly data layout organization, in-register implementation of key functions and SIMD-based compression. We demonstrate that using our optimized SIMD-based compression method, it is possible to improve the performance and energy of k-means by a factor of 4.5x and 8.7x, respectively, for a i7 Haswell machine, and 22x and 22.2x for Xeon Phi: KNL, running a single thread.acceptedVersionThis is a post-peer-review, pre-copyedit version of an article published in [Journal of Supercomputing]. The final authenticated version is available online at: https://doi.org/10.1007/s11227-018-2310-

    High Quality Delay Testing Scheme for a Self-Timed Microprocessor

    Get PDF
    RÉSUMÉ La popularité d’internet et la quantité toujours croissante de données qui transitent à travers ses terminaux nécessite d’importantes infrastructures de serveurs qui consomment énormément d’énergie. Par conséquent, et puisqu’une augmentation de la consommation d’énergie se traduit par une augmentation des coûts, la demande pour des processeurs efficaces en énergie est en forte hausse. Une manière d’augmenter l’efficacité énergétique des processeurs consiste à moduler la fréquence d’opération du système en fonction de la charge de travail. Les processeurs endochrones et asynchrones sont une des solutions mettant en œuvre ce principe de modulation de l’activité à la demande. Cependant, les méthodes de conception non conventionnelles qui leur sont associées, en particulier en termes de testabilité et d’automation, sont un frein au développement de ce type de systèmes. Ce travail s’intéresse au développement d’une méthode de test de haute qualité adressée aux pannes de retards dans une architecture de processeur endochrone spécifique, appelée AnARM. La méthode proposée consiste à détecter les pannes à faibles retards (PFR) dans l’AnARM en tirant profit des lignes à délais configurables intégrées. Ces pannes sont connues pour passer au travers des modèles de pannes de retards utilisés habituellement (les pannes de retards de portes). Ce travail s’intéresse principalement aux PFR qui échappent à la détection des pannes de retards de portes mais qui sont suffisamment longues pour provoquer des erreurs dans des conditions normales d’opération. D’autre part, la détection de pannes à très faibles retards est évitée, autant que possible, afin de limiter le nombre de faux positifs. Pour réaliser un test de haute qualité, ce travail propose, dans un premier temps, une métrique de test dédiée aux PFR, qui est mieux adaptée aux circuits endochrones, puis, dans un second temps, une méthode de test des pannes de retards basée sur la modulation de la vitesse des lignes à délais intégrés, qui s’adapte à un jeu de vecteurs de test préexistant.Ce travail présente une métrique de test ciblant les PFR, appelée pourcentage de marges pondérées (PoMP), ainsi qu’un nouveau modèle de test pour les PFR (appelé test de PFR idéal).----------ABSTRACT The popularity of the Internet and the huge amount of data that is transfered between devices nowadays requires very powerful servers that demand lots of power. Since higher power consumptions mean more expenses to companies, there is an increase in demand for power eÿcient processors. One of the ways to increase the power eÿciency of processors is to adapt the processing speeds and chip activity according the needed computation load. Self-timed or asynchronous processors are one of the solutions that apply this principle of activity on demand. However, their unconventional design methodology introduces several challenges in terms of testability and design automation. This work focuses on developing a high quality delay test for a specific architecture of self-timed processors called the AnARM. The proposed delay test focuses on catching e˙ective small-delay defects (SDDs) in the AnARM by taking advantage of built-in configurable delay lines. Those defects are known to escape one of the most commonly used delay fault models (the transition delay fault model). This work mainly focuses on e˙ective SDDs which can escape transition delay fault testing and are large enough to fail the circuit under normal operating conditions. At the same time, catching very small delay defects is avoided, when possible, to avoid falsely failing functional chips. To build the high quality delay test, this work develops an SDD test quality metric that is better suited for circuits with adaptable speeds. Then, it builds a delay test optimizer that adapts the built-in delay lines speeds to a preexisting at-speed pattern set to create a high quality SDD test. This work presents a novel SDD test quality metric called the weighted slack percentage (WeSPer), along with a new SDD testing model (named the ideal SDD test model). WeSPer is built to be a flexible metric capable of adapting to the availability of information about the circuit under test and the test environment. Since the AnARM can use multiple test speeds, WeSPer computation takes special care of assessing the effects of test frequency changes on the test quality. Specifically, special care is taken into avoiding overtesting the circuit. Overtesting will cause circuits under test to fail due to defects that are too small to affect the functionality of these circuits in their present state. A computation framework is built to compute WeSPer and compare it with other existing metrics in the literature over a large sets of process-voltage-temperature computation points. Simulations are done on a selected set of known benchmark circuits synthesized in the 28nm FD-SOI technology from STMicroelectronics

    Decentralized Control of Unbalanced Hybrid AC/DC Microgrids for Maximum Loadability

    Get PDF
    To realize a hybrid AC/DC microgrid, an interlinking converter (IC) is required to couple the AC microgrid (AMG) with the DC microgrid (DMG). While this configuration offers merits of bidirectional support and cooperation of both girds, which could improve the performance of the entire system, the ICs are left underutilized. The IC is an AC/DC converter and thusly has broad potential for AC-side support without any practical power cost on either side. This potential can be unleashed simply by utilizing an enhanced control and coordination schemes, without physical or hardware modification. The IC can provide different modes of support such reactive power support, and unbalance power mitigation. All of which could result in reduced loadability of the system, specifically on the AC-side. Several techniques have been demonstrated in the literature to utilize the IC for maximizing the loadability of the AC-side. However, all these methods are of the centralized type, or require extra or extraneous hardware. The centralized method, while more comprehensive, has several disadvantages, chiefly of them is complexity, which in turn impacts speed, cost, and reliability. The proposed work in this thesis develops a communication-free, decentralized method to achieve comparable results. The three-phase voltages and currents are decomposed into positive and negative sequence components. By controlling these sequences, the ICs can relieve the unbalanced power from the distributed generators of the AC-side, which helps maximizing their loadability. Furthermore, using a decentralized coordination method, the burden of the unbalanced power could be shared fairly between multiple ICs

    Design and Implementation of Integrated High Efficiency Low-voltage CMOS DC-DC Converters

    Get PDF
    RÉSUMÉ De nos jours, les appareils portatifs sont utilisés dans plusieurs applications. Ils utilisent en général une batterie qui doit être remplacée ou rechargée régulièrement. Dans le cas d'applications biomédicales, la durée de vie de la batterie est un paramètre critique. Pour un appareil implantable, une longue durée de vie est un objectif primordial. Cet objectif est généralement atteint en réduisant la consommation de puissance des circuits constituant l'implant. Parmi les diverses techniques existantes qui permettent la réduction de la consommation en puissance des circuits CMOS, on retrouve la technique d'ajustement dynamique de la tension (dynamic voltage scaling - DVS). En réduisant la tension d'alimentation, la consommation totale des circuits peut être diminuée. Cependant cette technique ne peut être implémentée sans faire appel à des circuits dédiés à une gestion intelligente de l'énergie. Dans ce contexte, l'utilisation de convertisseurs de tension DC-DC devient nécessaire pour économiser la charge de la batterie. Mais pour garantir une réduction effective de la consommation globale, des convertisseurs DC-DC de haute efficacité doivent être utilisés. A cette contrainte se rajoute la miniaturisation en utilisant des circuits hautement intégrés pour les applications telles que les implants biomédicaux. Le défi réside dans la conception d'un convertisseur DC-DC totalement intégré tout en assurant une haute efficacité sur une grande plage de tension de sortie. De plus, les appareils tels que les implants électroniques fonctionnent souvent en mode de veille pour réduire la consommation, entrainant ainsi des variations conséquentes de la charge du convertisseur DC-DC. Ceci rajoute un défi supplémentaire pour le maintient d'une haute efficacité de la conversion DC-DC à faible charge. Dans ce mémoire, nous présentons la conception détaillée d'un convertisseur DC-DC hautement efficace et totalement intégré dans une technologie CMOS à faible tension. Nous proposons une implémentation originale et totalement intégrée d'un convertisseur DC-DC à capacités commutés (switched capacitor - SC) opérant avec un contrôle asynchrone. L'efficacité du convertisseur est maintenue élevée en ajustant sa topologie et sa fréquence d'opération selon la charge.----------ABSTRACT Today, battery-powered portable devices are used in many applications. In applications like biomedical implants, the battery life is a major concern. Since replacing the battery of an implant needs a surgical procedure, a long battery life is a goal that all implants try to achieve. This is normally done by reducing the power dissipation in the implant's circuitry. One of the various techniques that exist for reducing the power consumption in CMOS circuitry is the dynamic voltage scaling (DVS) technique. By reducing the supply voltage, the overall power consumption of the circuits can be decreased. This technique cannot be implemented without power management blocks. The use of DC-DC converters becomes a must to save battery power. The overall power reduction can be improved by introducing high efficiency DC-DC converters. Moreover, to provide patients with the most comfort, small integrated circuits should be used in applications such as biomedical implants. The challenging aspect of designing integrated DC-DC converters is keeping the efficiency high while providing an adjustable output voltage. Additionally, devices such as electronic implants go in and out of stand-by mode to reduce power consumption. From the perspective of the DC-DC converter, the output load power is varying according to the mode of operation of the implant. This adds another challenge of sustaining the DC-DC conversion efficiency high under various loading conditions. At very light loads, preserving a high conversion efficiency is a challenge. In this master thesis, a detailed design of a high-efficiency low-voltage fully integrated DC-DC converter is presented. A unique structure of a fully integrated switched-capacitor (SC) DC-DC converter with asynchronous control is proposed. The efficiency of the converter is maintained high by adjusting the converter topology and operating frequency according to the loading conditions. The proposed SC DC-DC converter uses three different topologies to achieve three different conversion ratios. By doing so, the converter maintains high conversion efficiency at various output voltage levels. Also, an adaptive operating frequency is used by the asynchronous control to reduce efficiency losses at various loading conditions

    A Comparative Study of AHP and Fuzzy AHP Method for Inconsistent Data

    Get PDF
    In various cases of decision analysis we use two popular methods – Analytical Hierarchical Process (AHP) and Fuzzy based AHP or Fuzzy AHP. Both the methods deal with stochastic data and can determine decision result through Multi Criteria Decision Making (MCDM) process. Obviously resulting values of the two methods are not same though same set of data is fed into them. In this research work, we have tried to observe similarities and dissimilarities between two methods’ outputs. Almost same trend or fluctuations in outputs have been seen for both methods’ for same set of input data which are not consistent. Both method outputs’ ups and down fluctuations are same for fifty percent cases
    corecore