43 research outputs found

    Improving the performance of asymmetric multicore processors by transactions’ migration and the adaptation of the cache subsystem

    Get PDF
    Postojeći pravci razvoja računarstva imaju za cilj da se performanse računarskih sistema podignu na što viši nivo, da bi se zadovoljile potrebe za obradom velike količine podataka...Existing trends in computer design aim to raise the performance of computer systems to the highest possible level in order to meet the needs for processing large amounts of data. Attention is focused on the design of a processor as the main actor in the data processing process. Improvement trends in processor performance predicted by Moore’s Law has been slowing down recently due to physical limitations of semiconductor technology and increasing performance is getting harder and harder. This problem is attempted to be compensated by various techniques aimed at improving performance without increasing transistor and power consumption. In this thesis, asymmetric multicore processors with support for transactional memory are considered. Two new techniques have been proposed to increase the performance of such processors. One technique aims to reduce transaction congestion due to high parallelism by migrating transactions to a faster core. The transactions that contribute the most to an occurrence of congestion are selected for migration. Executing them on a faster core reduces their chances of conflict with other transactions and thus increases the chance of avoiding congestion. Another technique adjusts the cache subsystem to reduce caches’ access latency and to reduce the chances of false conflicts while reducing the number of transistors required to implement the cache. This can be achieved by using small and simple caches. Detailed implementation proposals are given for both techniques. Prototypes of these techniques were made in the Gem5 simulator, which models processor’s system in detail. Using prototypes, the proposed techniques were evaluated by simulating a large number of applications from a standard benchmark set for transactional memory. The analysis of the simulation results gave suggestions on how and when the proposed techniques should be used

    Advanced satellite communication system

    Get PDF
    The objective of this research program was to develop an innovative advanced satellite receiver/demodulator utilizing surface acoustic wave (SAW) chirp transform processor and coherent BPSK demodulation. The algorithm of this SAW chirp Fourier transformer is of the Convolve - Multiply - Convolve (CMC) type, utilizing off-the-shelf reflective array compressor (RAC) chirp filters. This satellite receiver, if fully developed, was intended to be used as an on-board multichannel communications repeater. The Advanced Communications Receiver consists of four units: (1) CMC processor, (2) single sideband modulator, (3) demodulator, and (4) chirp waveform generator and individual channel processors. The input signal is composed of multiple user transmission frequencies operating independently from remotely located ground terminals. This signal is Fourier transformed by the CMC Processor into a unique time slot for each user frequency. The CMC processor is driven by a waveform generator through a single sideband (SSB) modulator. The output of the coherent demodulator is composed of positive and negative pulses, which are the envelopes of the chirp transform processor output. These pulses correspond to the data symbols. Following the demodulator, a logic circuit reconstructs the pulses into data, which are subsequently differentially decoded to form the transmitted data. The coherent demodulation and detection of BPSK signals derived from a CMC chirp transform processor were experimentally demonstrated and bit error rate (BER) testing was performed. To assess the feasibility of such advanced receiver, the results were compared with the theoretical analysis and plotted for an average BER as a function of signal-to-noise ratio. Another goal of this SBIR program was the development of a commercial product. The commercial product developed was an arbitrary waveform generator. The successful sales have begun with the delivery of the first arbitrary waveform generator

    Energy/power consumption model for an embedded processor board

    Get PDF
    This dissertation, whose research has been conducted at the Group of Electronic and Microelectronic Design (GDEM) within the framework of the project Power Consumption Control in Multimedia Terminals (PCCMUTE), focuses on the development of an energy estimation model for the battery-powered embedded processor board. The main objectives and contributions of the work are summarized as follows: A model is proposed to obtain the accurate energy estimation results based on the linear correlation between the performance monitoring counters (PMCs) and energy consumption. the uniqueness of the appropriate PMCs for each different system, the modeling methodology is improved to obtain stable accuracies with slight variations among multiple scenarios and to be repeatable in other systems. It includes two steps: the former, the PMC-filter, to identify the most proper set among the available PMCs of a system and the latter, the k-fold cross validation method, to avoid the bias during the model training stage. The methodology is implemented on a commercial embedded board running the 2.6.34 Linux kernel and the PAPI, a cross-platform interface to configure and access PMCs. The results show that the methodology is able to keep a good stability in different scenarios and provide robust estimation results with the average relative error being less than 5%. Este trabajo fin de máster, cuya investigación se ha desarrollado en el Grupo de Diseño Electrónico y Microelectrónico (GDEM) en el marco del proyecto PccMuTe, se centra en el desarrollo de un modelo de estimación de energía para un sistema empotrado alimentado por batería. Los objetivos principales y las contribuciones de esta tesis se resumen como sigue: Se propone un modelo para obtener estimaciones precisas del consumo de energía de un sistema empotrado. El modelo se basa en la correlación lineal entre los valores de los contadores de prestaciones y el consumo de energía. Considerando la particularidad de los contadores de prestaciones en cada sistema, la metodología de modelado se ha mejorado para obtener precisiones estables, con ligeras variaciones entre escenarios múltiples y para replicar los resultados en diferentes sistemas. La metodología incluye dos etapas: la primera, filtrado-PMC, que consiste en identificar el conjunto más apropiado de contadores de prestaciones de entre los disponibles en un sistema y la segunda, el método de validación cruzada de K iteraciones, cuyo fin es evitar los sesgos durante la fase de entrenamiento. La metodología se implementa en un sistema empotrado que ejecuta el kernel 2.6.34 de Linux y PAPI, un interfaz multiplataforma para configurar y acceder a los contadores. Los resultados muestran que esta metodología consigue una buena estabilidad en diferentes escenarios y proporciona unos resultados robustos de estimación con un error medio relativo inferior al 5%

    Circuit simulation using distributed waveform relaxation techniques

    Get PDF
    Simulation plays an important role in the design of integrated circuits. Due to high costs and large delays involved in their fabrication, simulation is commonly used to verify functionality and to predict performance before fabrication. This thesis describes analysis, implementation and performance evaluation of a distributed memory parallel waveform relaxation technique for the electrical circuit simulation of MOS VLSI circuits. The waveform relaxation technique exhibits inherent parallelism due to the partitioning of a circuit into a number of sub-circuits. These subcircuits can be concurrently simulated on parallel processors. Different forms of parallelism in the direct method and the waveform relaxation technique are studied. An analysis of single queue and distributed queue approaches to implement parallel waveform relaxation on distributed memory machines is performed and their performance implications are studied. The distributed queue approach selected for exploiting the coarse grain parallelism across sub-circuits is described. Parallel waveform relaxation programs based on Gauss-Seidel and Gauss-Jacobi techniques are implemented using a network of eight Transputers. Static and dynamic load balancing strategies are studied. A dynamic load balancing algorithm is developed and implemented. Results of parallel implementation are analyzed to identify sources of bottlenecks. This thesis has demonstrated the applicability of a low cost distributed memory multi-computer system for simulation of MOS VLSI circuits. Speed-up measurements prove that a five times improvement in the speed of calculations can be achieved using a full window parallel Gauss-Jacobi waveform relaxation algorithm. Analysis of overheads shows that load imbalance is the major source of overhead and that the fraction of the computation which must be performed sequentially is very low. Communication overhead depends on the nature of the parallel architecture and the design of communication mechanisms. The run-time environment (parallel processing framework) developed in this research exploits features of the Transputer architecture to reduce the effect of the communication overhead by effectively overlapping computation with communications, and running communications processes at a higher priority. This research will contribute to the development of low cost, high performance workstations for computer-aided design and analysis of VLSI circuits

    Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

    Get PDF
    Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements

    Performance of sequential batching-based methods of output data analysis in distributed steady-state stochastic simulation

    Get PDF
    Wir haben die Anpassung von Sequentiellen Analysemethoden von Stochastik Simulationen an einem Szenario von mehreren Unabhängigen Replikationen in Parallel (MRIP) untersucht. Die Hauptidee ist, die statistische Kontrole bzw. die Beschleunigung eines Simulationexperiment zu automatisieren. Die vorgeschlagenen Methoden der Literatur sind auf einzelne Prozessorszenarien orientiert. Wenig ist bekannt hinsichtlich der Anwendungen von Verfahen, die auf Methoden unter MRIP basieren. Auf den ersten Blick sind beide Ziele entgegengesetzt, denn man braucht eine grosse Menge von Beobachtungen, um eine hohe Qualität der Resultate zu erreichen. Dafür benötig man viel Zeit. Man kann jedoch durch einen ausfürlichen Entwurf zusammen mit einem robusten Werkzeug, das auf unabhängige Replikationen basiert ist, ein effizientes Mittel bezüglich Analyse der Resultate produzieren. Diese Recherche wurde mit einer sequentiellen Version des klassischen Verfahren von Nonoverlaping Batch Means (NOBM) angefangen. Obwohl NOBM sehr intuitiv und populär ist, bietet es keine gute Lösung für das Problem starker Autokorrelation zwischen den Beobachtungen an, die normalerweise bei hohen Auslastungen entstehen. Es lohnt sich nicht, grösserer Rechnerleistung zu benutzen, um diese negative Merkmale zu vermindern. Das haben wir mittles einer vollständigen Untersuchung einer Gruppe von Warteschlangsystemen bestätig. Deswegen haben wir den Entwurf von sequentiellen Versionen von ein paar Varianten von Batch Means vorgeschlagen und sie genauso untersucht. Unter den implementierten Verfahren gibt es ein sehr attraktives: Overlapping Batch Means (OBM). OBM ermöglicht eine bessere Nutzung der Daten, da jede Beobachtungen ein neues Batch anfängt, d.h., die Anzahl von Batches ist viel grösser, und das ergibt eine kleinere Varianz. In diesem Fall ist die Anwendung von MRIP empfehlenswert, da diese Kombination weniger Beobachtungen benötigt und somit eine höhere Beschleunigung. Im Laufe der Recherche haben wir eine Klasse von Methoden (Standardized Time Series - STS) untersucht, die teoretisch bessere asymptotische Resultate als NOBM produziert. Die negative Auswirkung von STS ist, dass sie mehr Beobachtungen als die Batch-Means-Verfahren benoetigt. Aber das ist kein Hindernis, wenn wir STS zusammen mit MRIP anwenden. Die experimentelle Untersuchungen bestätigte, dass die Hypothese richtig ist. Die nächste Phase war es, OBM und STS einzustellen, um beide Verfahren unter den grösstmöglichen Anzahl von Prozessoren arbeiten lassen zu können. Fallstudien zeigten uns, dass sich beide sequentiellen Verfahren für die parallele Simulation sowie MRIP einigen.We investigated the feasibility of sequential methods of analysis of stochastic simulation under an environment of Multiple Replications in Parallel (MRIP). The main idea is twofold, the automation of the statistical control and speedup of simulation experiments. The methods of analysis found suggested in the literature were conceived for a single processor environment. Very few is known concerning the application of procedures based in such methods under MRIP. At first glance, sind both goals in opposition, since one needs a large amount of observations in order to achieve good quality of the results, i.e., the simulation takes frequently long time. However, by means of a careful design, together with a robust simulation tool based on independent replications, one can produce an efficient instrument of analysis of the simulation results. This research began with a sequential version of the classical method of Nonoverlapping Batch Means (NOBM). Although intuitiv and popular, under hight traffic intensity NOBM offers no good solution to the problem of strong correlation among the observations. It is not worthwhile to apply more computing power aiming to diminish this negative effect. We have confirmed this claim by means of a detailed and exhaustive analysis of four queuing systems. Therefore, we proposed the design of sequential versions of some Batch Means variants, and we investigated their statistical properties under MRIP. Among the implemented procedures there is one very attractive : Overlapping Batch Means (OBM). OBM makes a better use of collected data, since each observation initiates a new (overlapped) batch, that is, die number of batches is much larger, and this yields smaller variance. In this case, MRIP is highly recommended, since this combination requires less observations and, therefore, speedup. During the research, we investigated also a class of methods based on Standardized Time Series -- STS, that produces theoretically better asymptotical results than NOBM. The undesired negative effect of STS is the large number of observations it requires, when compared to NOBM. But that is no obstacle when we apply STS together with MRIP. The experimental investigation confirmed this hypothesis. The next phase was to tun OBM and STS, in order to put them working with the possible largest number of processors. A case study showed us that both procedures are suitable to the environment of MRIP
    corecore