258 research outputs found

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

    Get PDF
    To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

    Description and Optimization of Abstract Machines in a Dialect of Prolog

    Full text link
    In order to achieve competitive performance, abstract machines for Prolog and related languages end up being large and intricate, and incorporate sophisticated optimizations, both at the design and at the implementation levels. At the same time, efficiency considerations make it necessary to use low-level languages in their implementation. This makes them laborious to code, optimize, and, especially, maintain and extend. Writing the abstract machine (and ancillary code) in a higher-level language can help tame this inherent complexity. We show how the semantics of most basic components of an efficient virtual machine for Prolog can be described using (a variant of) Prolog. These descriptions are then compiled to C and assembled to build a complete bytecode emulator. Thanks to the high level of the language used and its closeness to Prolog, the abstract machine description can be manipulated using standard Prolog compilation and optimization techniques with relative ease. We also show how, by applying program transformations selectively, we obtain abstract machine implementations whose performance can match and even exceed that of state-of-the-art, highly-tuned, hand-crafted emulators.Comment: 56 pages, 46 figures, 5 tables, To appear in Theory and Practice of Logic Programming (TPLP

    Immutable data types in concurrent programming on basis of Clojure language

    Get PDF
    Konkurentne programmeerimine keskendub probleemidele, kus erinevaid ressursse tuleb jagada mitme lõime vahel. Kõige lihtsamal juhul võib selleks olla protsessori arvutusressurss, kuid tänapäevased mitme tuumaga protsessorid lisavad probleemile lisamõõtme, kus valdavaks probleemiks saab mälu ühine konkurentne kasutamine. Selle töö eesmärk on uurida konkurentses programmerimises esinevaid probleeme ja võimalikke lahendusi Java ja Clojure keelte näite varal pannes rõhku keeles Clojure kasutusele võetud uuendustele. Leitakse, et konkrurentne programmeerimine Javas pärib enamiku probleemidest konkurentsete programmeerimise vahendite suhteliselt madalatasemelisest lisamisest Java keelde. Enamik probleeme tuleneb ühismälu mudeli kasutuselevõtust. Kuna Javas on võimalik pöörduda ühismälu poole korrektse vastastiku välistuseta, siis võib see põhustada raskesti leitavaid tarkvara vigu. Peale selle võib Java lukkudel põhinev vastastik välistus luua raskesti leitavaid uusi probleeme. Näiteks võib programm sisaldada tupikut, kus programmi kaks lõime ootavad vastastikku võetud lukkude taga tänu ebakorrektsele lukkude võtmise järjekorrale programmis. Lukke kasutades on keeruline koostada mitmest eraldi seisvast atomaarsest operatsioonist uut ühendatud atomaarset operatsiooni. Töös leitakse, et Lispist inspireeritud funktsionaalne Java platformil põhinev programmerimiskeel Clojure pakub rohkem piiratud reeglistikku andmete jagamiseks mitme lõime vahel. Kõige olulisem on, et kõik andmete jagamised mitme lõime vahel peavad olema väljendatud tahtlikult, mis võib arvatavalt vähendada programmeerimisvigade hulka. Clojures võib andmeid jagada asünkroonselt kasutades agente või sünkroonselt, kas kasutades tarkvaralisi mälutransaktsioone või lihtsamaid atomaarseid uuendusi üksikväärtuse jagamiseks. Clojure tarkvaralised mälutransaktsioonid pakuvad lihtsa viisi mitme eraldi atomaarse operatsiooni uueks tervikuks kombineerimiseks. Clojure tarkvaralised mälutransaktsioonid võivad lisaks vähendada koodi vigu, kuna need kontrollivad programmi töö käigus, et jagatud mälu poole pöördumine toimuks transaktsiooni siseselt. Töös jõutakse järeldusele, et eelnevale vaatamata ei vabasta see programmeerijat vajalike atomaarsete operatsioonide korrektsest tuvastamisest programmi koodis. Clojure lähenemine konkurentsele programmeerimisele põhineb muutumatute muutujate kontseptsioonil. Muutumatud muutujad võimaldavad kasutada keerukaid andmestruktuure lihtsate väärtustena, mille olek ei muutu viite haldaja kontrolli väliselt. Seega on oluline, et Cloure pakuks erinevaid andmeüüpe, mis järgivaid neid printsiipe. Üks sellistest andmestruktuuridest Clojures on Persistent Vector – Clojure suvapöördusega loend. Käesolevas töös uuriti selle andmestruktuuri ehitust ja jõudlust. Kokkuvõtvalt võib öelda, et tegemist on “bitmapped” trie andmetüübiga, millel on kõrge hargnevustegur, mis võimaldab puhverdada lisamise operatsioone kogudes lisatavad elemendid esmalt nii öelda sabapuhvermällu ennem nende lisamist terviklikuna puusse. Persistent Vector andmetüübi ülesehitus võimaldab sel jagada oma sisemist struktuuri oma eelnevate versioonidega, mis teeb sellest tõhusa muutumatu andmetüübi. Mõõtmised näitavad, et võrdluses Java ArrayList andmetüübiga pakkub see sarnast jõudlust nii elementide lisamisel nimekirja lõppu kui ka nimekirja järjestikusel läbimisel. Elemendi positsiooni järgi uuendamise jõudlus on siiski kaks suurusjärku madalam. Elemendi positsiooni järgi pärimise jõudlusele ei õnnestunud anda selgepiirilist hinnangut tänu arvatavasti Java JIT kompilaatori poolt põhjustatud probleemidele ArrayList jõudluse hindamisel. Tulemused annavad siiski alust spekuleerida, et andmetüüpide Persistent Vector ja ArrayList positsiooni järgi pärimise jõudlus on sarnane positsiooni järgi uuendamise jõudlusega. Käesolevas töös analüüsiti erinevaid jõudluse paranduse ettepanekuid. Võib järeldada, et Persistent Vector nimekirja lõppu lisamise jõudlust on võimalik tõsta ligikaudu kaks korda, kui jagada selle lisamise sabapuhvermälu ühe lõime piires. Võib arvata, et piisavalt hea lisamise ja läbimise operatsioonide jõudlus võimaldaks Persistent Vector andmetüüpi kasutada mitmete praktiliste ülesannete lahendamisel. Näiteks võiks seda kasutada andmebaasist laetud nimekirjast veebilehe koostamisel vahepuhvermäluna. Korrektsete paralleeltestide koostamise keerukuse tõttu parallelljõudluse testid ei kajastu antud töös. Seega võib soovitada nende testide sooritamist edasiseks uurimisvaldkonnaks. Kokkuvõtvalt jõuti töös järeldusele, et Clojure näitab, et on võimalik muuta konkurentne programmerimine suhteliselt turvaliseks, kui loetletud disaini printsiibid on järgitud. Võib arutleda,et raskused Java konkurentses programeerimises ei vähene kuniks Java mälu kasutus ei ole kriitiliselt üle vaadatud.Concurrent programming tries to solve the problems where there is a need to share different resources between different threads. In most simplest case it is about sharing the processor time but modern multi-core processors do add a new dimension where the access to the shared memory becomes the most essential problem. It is concluded in this work that the concurrent programming in Java inherits most of its problems from the direct incorporation of the shared memory model. Because it is possible in Java to access shared memory without properly applied mutual exclusion, it can produce hard to detect software bugs. Moreover Java lock based mutual exclusion can introduce additional hard to detect problems. Most notoriously the program can contain a possible deadlock when lock acquisition is not correctly ordered. It is difficult to compose separate thread safe atomic operations into a new atomic operation using locks because an additional complex synchronization is required when combining multiple method calls. The main focus of this work is on the innovations provided by the Clojure language. It is concluded in this work that a new Lisp inspired functional language Clojure that is implemented on top of the Java platform introduces a more limited ruleset for the data sharing between different threads. Most importantly all data sharing operations must be expressed explicitly. This approach can arguably reduces the set of possible programming errors. Clojure offers two methods for the data sharing. It can be accomplished asynchronously with the agents or synchronously with either software transactional memory (STM) for operations that require updating multiple values in one atomic operations or with simpler atomic updates when only one values is shared. Clojure STM provides syntactically simple method for combining multiple separate atomic operations into new atomic operation by simply wrapping given operations into a new transaction. Clojure STM can reduce programming errors further by using runtime verification to check that no updates are performed outside of the transaction. It was concluded that regardless of the above Clojure does not free the programmer from correctly identifying set of the operations that should be executed atomically. Clojure method of the concurrent programming relies heavily on the immutable data structures. Immutability lets it regard complex data structures as simple values whose state does not change outside of the control of the reference holder. Therefore it is important for Clojure to provide rich set of different data structures that follow these principles. One of such data structures in Clojure is Persistent Vector from Clojure collections library. The internal working principles of this data type were explored. In summary it is a bit mapped trie with the high branching factor that allows possibility of the deferred additions into the end of the vector by collecting the new elements into a tail buffer before pushing them into the trie as a whole. Persistent Vector can share a bulk of its internal structure with the previous versions making it effective immutable data structure. The actual performance of the Persistent Vector was evaluated. The findings show that it can provide addition and iteration operation performance compared to the Java collections ArrayList. The update by index performs two orders of magnitude slower than analogue operation on ArrayList. The performance difference of lookup by index operation was not conclusively determined due probably JIT induced difficulty to measure ArrayList index lookups reliably. Performed measurements still allow to speculate that the performance difference of the index lookup operation between Persistent Vector and ArrayList is similar to the performance difference of the update by index operation. Few additional performance enchantments were evaluated and it was concluded that it would be possible to improve the addition operation performance around two times when additional thread confined flag is used to allow further sharing of the tail buffer between different versions. It can be argued that relatively good addition and iterating performance would allow to use Persistent Vector to solve a set of useful problems. For example the Persistent Vector can be used to load a list of the records from the database to be iterated over to build a web page based on that data. Due hardness of the proper performance testing of the parallel operations such tests were not included into this work. It can be suggested that testing the performance of sharing persistent vector between multiple threads is needed. It was concluded in summary that Clojure shows that it is possible to make concurrent programming relatively safer when a set of design principles are changed. It can be argued that difficulty of concurrent programming in Java does not improve unless its memory access principles are considerably reevaluated

    lmproving Microcontroller and Computer Architecture Education through Software Simulation

    Get PDF
    In this thesis, we aim to improve the outcomes of students learning Computer Architecture and Embedded Systems topics within Software and Computer Engineering programs. We develop a simulation of processors that attempts to improve the visibility of hardware within the simulation environment and replace existing solutions in use within the classroom. We designate a series of requirements of a successful simulation suite based on current state-of-the-art simulations within literature. Provided these requirements, we build a quantitative rating of the same set of simulations. Additionally, we rate our previously implemented tool, hc12sim, with current solutions. Using the gaps in implementations from our state-of-the-art survey, we develop two solutions. First, we developed a web-based solution using the Scala.js compiler for Scala with an event-driven simulation engine through Akka. This Scala model implements a VHDL-like DSL for instruction control definition. Next we propose tools for developing cross-platform native applications through a project-based build system within CMake and a continuous integration pipeline using Vagrant, Oracle VirtualBox and Jenkins. Lastly, we propose a configuration-driven processor simulation built from the original hc12sim project that utilizes a Lua-based scripting interface for processor configuration. While we considered other high-level languages, Lua best fit our requirements allowing students to use a modern high-level programming language for processor configuration. Instruction controls are defined through Lua functions using high-level constructs that implicitly trigger low-level simulation events. Lastly, we conclude with suggestions for building a new solution that would better meet requirements set forth in our research question building from successful aspects from this work

    Dynamic Dependency Collapsing

    Get PDF
    In this dissertation, we explore the concept of dynamic dependency collapsing. Performance increases in computer architecture are always introduced by exploiting additional parallelism when the clock speed is fixed. We show that further improvements are possible even when the available parallelism in programs are exhausted. This performance improvement is possible due to executing instructions in parallel that would ordinarily have been serialized. We call this concept dependency collapsing. We explore existing techniques that exploit parallelism and show which of them fall under the umbrella of dependency collapsing. We then introduce two dependency collapsing techniques of our own. The first technique collapses data dependencies by executing two normally dependent instructions together by fusing them. We show that exploiting the additional parallelism generated by collapsing these dependencies results in a performance increase. Our second technique collapses resource dependencies to execute instructions that would normally have been serialized due to resource constraints in the processor. We show that it is possible to take advantage of larger in-processor structures while avoiding the power and area penalty this often implies

    PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

    Get PDF
    Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers
    corecore