10 research outputs found

    Efficient resources assignment schemes for clustered multithreaded processors

    Get PDF
    New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version

    Why area might reduce power in nanoscale CMOS

    Get PDF
    In this paper we explore the relationship between power and area. By exploiting parallelism (and thus using more area) one can reduce the switching frequency allowing a reduction in VDD which results in a reduction in power. Under a scaling regime which allows threshold voltage to increase as VDD decreases we find that dynamic and subthreshold power loss in CMOS exhibit a dependence on area proportional to A(s-3)s/ while gate leakage power ? A(s-6)s/, and short circuit power ? A(s-8)s/. Thus, with the large number of devices at our disposal we can exploit techniques such as spatial computing, tailoring the program directly to the hardware, to overcome the negative effects of scaling. The value of s describes the effectiveness of the technique for a particular circuit and/or algorithm - for circuits that exhibit a value of s =3, power will be a constant or reducing function of area. We briefly speculate on how s might be influenced by a move to nanoscale technology

    Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs)

    Get PDF
    The general-purpose computing processor performs a wide range of functions. Although the performance of general-purpose processors has been steadily increasing, certain software technologies like multimedia and digital signal processing applications demand ever more computing power. Reconfigurable computing has emerged to combine the versatility of general-purpose processors with the customization ability of ASICs. The basic premise of reconfigurability is to provide better performance and higher computing density than fixed configuration processors. Most of the research in reconfigurable computing is dedicated to on-chip functional logic. If computing resources are adaptable to the computing requirement, the maximum performance can be achieved. To overcome the gap between processor and memory technology, the size of on-chip cache memory has been consistently increasing. The larger cache memory capacity, though beneficial in general, does not guarantee a higher performance for all the applications as they may not utilize all of the cache efficiently. To utilize on-chip resources effectively and to accelerate the performance of multimedia applications specifically, we propose a new architecture---Adaptive Balanced Computing (ABC). ABC uses dynamic resource configuration of on-chip cache memory by integrating Reconfigurable Functional Caches (RFC). RFC can work as a conventional cache or as a specialized computing unit when necessary. In order to convert a cache memory to a computing unit, we include additional logic to embed multi-bit output LUTs into the cache structure. We add the reconfigurability of cache memory to a conventional processor with minimal modification to the load/store microarchitecture and with minimal compiler assistance. ABC architecture utilizes resources more efficiently by reconfiguring the cache memory to computing units dynamically. The area penalty for this reconfiguration is about 50--60% of the memory cell cache array-only area with faster cache access time. In a base array cache (parallel decoding caches), the area penalty is 10--20% of the data array with 1--2% increase in the cache access time. However, we save 27% for FIR and 44% for DCT/IDCT in area with respect to memory cell array cache and about 80% for both applications with respect to base array cache if we were to implement all these units separately (such as ASICs). The simulations with multimedia and DSP applications (DCT/IDCT and FIR/IIR) show that the resource configuration with the RFC speedups ranging from 1.04X to 3.94X in overall applications and from 2.61X to 27.4X in the core computations. The simulations with various parameters indicate that the impact of reconfiguration can be minimized if an appropriate cache organization is selected

    Instruction scheduling in micronet-based asynchronous ILP processors

    Get PDF

    Algorithmes exacts et approchés pour des problèmes d'ordonnancement et de placement

    Get PDF
    Dans cette thèse, nous nous intéressons à la résolution de quelques problèmes d'optimisation combinatoires que nous avons choisi de traiter en deux volets. Dans un premier temps, nous étudions des problèmes d'optimisation issus de l'ordonnancement d'un ensemble de tâches sur des machines de calcul et où on cherche à minimiser l'énergie totale consommée par ces machines tout en préservant une qualité de service acceptable. Dans un deuxième temps, nous traitons deux problèmes d'optimisation classiques à savoir un problème d'ordonnancement dans une architecture de machines parallèles avec des temps de communication, et un problème de placement de données dans des graphes modélisant des réseaux pair-à-pair et visant à minimiser le coût total d'accès aux données.In this thesis, we focus on solving some combinatorial optimization problems that we have chosen to study in two parts. Firstly, we study optimization problems issued from scheduling a set of tasks on computing machines where we seek to minimize the total energy consumed by these machines while maintaining acceptable quality of service. In a second step, we discuss two optimization problems, namely a classical scheduling problem in architecture of parallel machines with communication delays, and a problem of placing data in graphs that represent peer-to-peer networks and the goal is to minimize the total cost of data access.EVRY-Bib. électronique (912289901) / SudocSudocFranceF

    Conception d’une architecture robuste pour l’acquisition de grandeurs physiques dans un système aéronautique critique : application à la mesure de température, pression, couple, et vitesse d’une turbomachine

    Get PDF
    The acquisition of physical parameters as such as temperature, pressure, torque, and speed are necessary to flight critical systems in order to reach and ensure safety and availability required. Consequently, it requires implementing high technologies and techniques which are able to work in rugged environments.The aim of our work is to design a new architecture for sensor acquisition systems in order to be integrated onto a flight critical system. The goal of the architecture is to ensure data integrity, system's availability and safety relative to airborne critical systems. The solution adds the fault tolerance ability to the signal conditioning. Consequently, we implement additional functionalities, as such as mathematical model of the signal conditioning, in order to make the acquisition system more intelligent.Our research work is partially based on technical specifications from SYRENA project, which is a typical example of flight critical systems, which is the main thematic of our purpose.L’acquisition de paramètres physiques tels que la température, la pression, le couple et la vitesse est nécessaire aux systèmes aéronautiques critiques afin d’atteindre et d’assurer les performances requises de disponibilité et de sécurité de fonctionnement. L’acquisition de ces paramètres physiques nécessite donc la mise en oeuvre de technologies et de techniques hautement éprouvées pouvant supporter les conditions de fonctionnement sévères.L'objectif des travaux présentés dans ce mémoire est de proposer une nouvelle architecture de chaîne d'acquisition de grandeurs physiques pour être intégrée à un système aéronautique critique. Le but de cette architecture est d'améliorer l'intégrité des données mesurées tout en maintenant leur disponibilité et le niveau de sûreté de fonctionnement propre aux systèmes aéronautiques de haute criticité. La solution se déploie sous la forme d'une amélioration de la tolérance aux défauts de la chaîne de traitement du signal issu du capteur. Pour ce faire, nous intégrons des fonctions supplémentaires, dont le modèle mathématique de la chaîne d'acquisition, rendant ainsi le système plus intelligent.Dans le cadre de nos travaux de recherche, nous nous appuyons sur les spécifications techniques d'un projet industriel typique des systèmes aéronautiques critiques, qui est le coeur de notre thématique principale

    A hardware-software codesign framework for cellular computing

    Get PDF
    Until recently, the ever-increasing demand of computing power has been met on one hand by increasing the operating frequency of processors and on the other hand by designing architectures capable of exploiting parallelism at the instruction level through hardware mechanisms such as super-scalar execution. However, both these approaches seem to have reached a plateau, mainly due to issues related to design complexity and cost-effectiveness. To face the stabilization of performance of single-threaded processors, the current trend in processor design seems to favor a switch to coarser-grain parallelization, typically at the thread level. In other words, high computational power is achieved not only by a single, very fast and very complex processor, but through the parallel operation of several processors, each executing a different thread. Extrapolating this trend to take into account the vast amount of on-chip hardware resources that will be available in the next few decades (either through further shrinkage of silicon fabrication processes or by the introduction of molecular-scale devices), together with the predicted features of such devices (e.g., the impossibility of global synchronization or higher failure rates), it seems reasonable to foretell that current design techniques will not be able to cope with the requirements of next-generation electronic devices and that novel design tools and programming methods will have to be devised. A tempting source of inspiration to solve the problems implied by a massively parallel organization and inherently error-prone substrates is biology. In fact, living beings possess characteristics, such as robustness to damage and self-organization, which were shown in previous research as interesting to be implemented in hardware. For instance, it was possible to realize relatively simple systems, such as a self-repairing watch. Overall, these bio-inspired approaches seem very promising but their interest for a wider audience is problematic because their heavily hardware-oriented designs lack some of the flexibility achievable with a general purpose processor. In the context of this thesis, we will introduce a processor-grade processing element at the heart of a bio-inspired hardware system. This processor, based on a single-instruction, features some key properties that allow it to maintain the versatility required by the implementation of bio-inspired mechanisms and to realize general computation. We will also demonstrate that the flexibility of such a processor enables it to be evolved so it can be tailored to different types of applications. In the second half of this thesis, we will analyze how the implementation of a large number of these processors can be used on a hardware platform to explore various bio-inspired mechanisms. Based on an extensible platform of many FPGAs, configured as a networked structure of processors, the hardware part of this computing framework is backed by an open library of software components that provides primitives for efficient inter-processor communication and distributed computation. We will show that this dual software–hardware approach allows a very quick exploration of different ways to solve computational problems using bio-inspired techniques. In addition, we also show that the flexibility of our approach allows it to exploit replication as a solution to issues that concern standard embedded applications
    corecore