132 research outputs found

    Recursos anchos: una técnica de bajo coste para explotar paralelismo agresivo en códigos numéricos

    Get PDF
    Els bucles son la part que més temps consumeix en les aplicacions numèriques. El rendiment dels bucles està limitat tant pels recursos oferts per l'arquitectura com per les recurrències del bucle en la computació. Per executar més operacions per cicle, els processadors actuals es dissenyen amb graus creixents de replicació de recursos (tècnica de replicació) para ports de memòria i unitats funcionals. En canvi, el gran cost en termes d'àrea i temps de cicle d'aquesta tècnica limita tenir alts graus de replicació: alts valors en temps de cicle contraresten els guanys deguts al decrement en el nombre de cicles, mentre que alts valors en l'àrea requerida poden portar a configuracions impossibles d'implementar. Una alternativa a la replicació de recursos, és fer los més amples (tècnica que anomenem "widening"), i que ha estat usada en alguns dissenys recents. Amb aquesta tècnica, l'amplitud dels recursos s'amplia, fent una mateixa operació sobre múltiples dades. Per altra banda, alguns microprocessadors escalars de propòsit general han estat implementats amb unitats de coma flotants que implementen la instrucció sumar i multiplicar unificada (tècnica de fusió), el que redueix la latència de la operació combinada, tanmateix com el nombre de recursos utilitzats. A aquest treball s'avaluen un ampli conjunt d'alternatives de disseny de processadors VLIW que combinen les tres tècniques. S'efectua una projecció tecnològica de les noves generacions de processadors per predir les possibles alternatives implementables. Com a conclusió, demostrem que tenint en compte el cost, combinar certs graus de replicació i "widening" als recursos hardware és més efectiu que aplicar únicament replicació. Així mateix, confirmem que fer servir unitats que fusionen multiplicació i suma pot tenir un impacte molt significatiu en l'increment de rendiment en futures arquitectures de processadors a un cost molt raonable.Loops are the main time-consuming part of numerical applications. The performance of the loops is limited either by the resources offered by the architecture or by recurrences in the computation. To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. However, the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. High values for the cycle time may clearly offset any gain in terms of number of execution cycles. High values for the area may lead to an unimplementable configuration. An alternative to resource replication is resource widening (widening technique), which has also been used in some recent designs in which the width of the resources is increased (i.e., a single operation is performed over multiple data). Moreover, several general-purpose superscalar microprocessors have been implemented with multiply-add fused floating point units (fusion technique), which reduces the latency of the combined operation and the number of resources used. On this thesis, we evaluate a broad set of VLIW processor design alternatives that combine the three techniques. We perform a technological projection for the next processor generations in order to foresee the possible implementable alternatives. From this study, we conclude that if the cost is taken into account, combining certain degrees of replication and widening in the hardware resources is more effective than applying only replication. Also, we confirm that multiply-add fused units will have a significant impact in raising the performance of future processor architectures with a reasonable increase in cost

    An investigation into computer and network curricula

    Get PDF
    This thesis consists of a series of internationally published, peer reviewed, journal and conference research papers that analyse the educational and training needs of undergraduate Information Technology (IT) students within the area of Computer and Network Technology (CNT) Education. Research by Maj et al has found that accredited computing science curricula can fail to meet the expectations of employers in the field of CNT: “It was found that none of these students could perform first line maintenance on a Personal Computer (PC) to a professional standard with due regard to safety, both to themselves and the equipment. Neither could they install communication cards, cables and network operating system or manage a population of networked PCs to an acceptable commercial standard without further extensive training. It is noteworthy that none of the students interviewed had ever opened a PC. It is significant that all those interviewed for this study had successfully completed all the units on computer architecture and communication engineering (Maj, Robbins, Shaw, & Duley, 1998). The students\u27 curricula at that time lacked units in which they gained hands-on experience in modern PC hardware or networking skills. This was despite the fact that their computing science course was level one accredited, the highest accreditation level offered by the Australian Computer Society (ACS). The results of the initial survey in Western Australia led to the introduction of two new units within the Computing Science Degree at Edith Cowan University (ECU), Computer Installation & Maintenance (CIM) and Network Installation & Maintenance (NIM) (Maj, Fetherston, Charlesworth, & Robbins, 1998). Uniquely within an Australian university context these new syllabi require students to work on real equipment. Such experience excludes digital circuit investigation, which is still a recommended approach by the Association for Computing Machinery (ACM) for computer architecture units (ACM, 2001, p.97). Instead, the CIM unit employs a top-down approach based initially upon students\u27 everyday experiences, which is more in accordance with constructivist educational theory and practice. These papers propose an alternate model of IT education that helps to accommodate the educational and vocational needs of IT students in the context of continual rapid changes and developments in technology. The ACM have recognised the need for variation noting that: There are many effective ways to organize a curriculum even for a particular set of goals and objectives (Tucker et al., 1991, p.70). A possible major contribution to new knowledge of these papers relates to how high level abstract bandwidth (B-Node) models may contribute to the understanding of why and how computer and networking technology systems have developed over time. Because these models are de-coupled from the underlying technology, which is subject to rapid change, these models may help to future-proof student knowledge and understanding of the ongoing and future development of computer and networking systems. The de-coupling is achieved through abstraction based upon bandwidth or throughput rather than the specific implementation of the underlying technologies. One of the underlying problems is that computing systems tend to change faster than the ability of most educational institutions to respond. Abstraction and the use of B-Node models could help educational models to more quickly respond to changes in the field, and can also help to introduce an element of future-proofing in the education of IT students. The importance of abstraction has been noted by the ACM who state that: Levels of Abstraction: the nature and use of abstraction in computing; the use of abstraction in managing complexity, structuring systems, hiding details, and capturing recurring patterns; the ability to represent an entity or system by abstractions having different levels of detail and specificity (ACM, 1991b). Bloom et al note the importance of abstraction, listing under a heading of: “Knowledge of the universals and abstractions in a field” the objective: Knowledge of the major schemes and patterns by which phenomena and ideas arc organized. These are large structures, theories, and generalizations which dominate a subject or field or problems. These are the highest levels of abstraction and complexity\u27\u27 (Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956, p. 203). Abstractions can be applied to computer and networking technology to help provide students with common fundamental concepts regardless of the particular underlying technological implementation to help avoid the rapid redundancy of a detailed knowledge of modem computer and networking technology implementation and hands-on skills acquisition. Again the ACM note that: “Enduring computing concepts include ideas that transcend any specific vendor, package or skill set... While skills are fleeting, fundamental concepts are enduring and provide long lasting benefits to students, critically important in a rapidly changing discipline (ACM, 2001, p.70) These abstractions can also be reinforced by experiential learning to commercial practices. In this context, the other possibly major contribution of new knowledge provided by this thesis is an efficient, scalable and flexible model for assessing hands-on skills and understanding of IT students. This is a form of Competency-Based Assessment (CBA), which has been successfully tested as part of this research and subsequently implemented at ECU. This is the first time within this field that this specific type of research has been undertaken within the university sector within Australia. Hands-on experience and understanding can become outdated hence the need for future proofing provided via B-Nodes models. The three major research questions of this study are: •Is it possible to develop a new, high level abstraction model for use in CNT education? •Is it possible to have CNT curricula that are more directly relevant to both student and employer expectations without suffering from rapid obsolescence? •Can WI effective, efficient and meaningful assessment be undertaken to test students\u27 hands-on skills and understandings? The ACM Special Interest Group on Data Communication (SJGCOMM) workshop report on Computer Networking, Curriculum Designs and Educational Challenges, note a list of teaching approaches: ... the more \u27hands-on\u27 laboratory approach versus the more traditional in-class lecture-based approach; the bottom-up approach towards subject matter verus the top-down approach (Kurose, Leibeherr, Ostermann, & Ott-Boisseau, 2002, para 1). Bandwidth considerations are approached from the PC hardware level and at each of the seven layers of the International Standards Organisation (ISO) Open Systems Interconnection (OSI) reference model. It is believed that this research is of significance to computing education. However, further research is needed

    Survey on Instruction Selection: An Extensive and Modern Literature Review

    Full text link
    Instruction selection is one of three optimisation problems involved in the code generator backend of a compiler. The instruction selector is responsible of transforming an input program from its target-independent representation into a target-specific form by making best use of the available machine instructions. Hence instruction selection is a crucial part of efficient code generation. Despite on-going research since the late 1960s, the last, comprehensive survey on the field was written more than 30 years ago. As new approaches and techniques have appeared since its publication, this brings forth a need for a new, up-to-date review of the current body of literature. This report addresses that need by performing an extensive review and categorisation of existing research. The report therefore supersedes and extends the previous surveys, and also attempts to identify where future research should be directed.Comment: Major changes: - Merged simulation chapter with macro expansion chapter - Addressed misunderstandings of several approaches - Completely rewrote many parts of the chapters; strengthened the discussion of many approaches - Revised the drawing of all trees and graphs to put the root at the top instead of at the bottom - Added appendix for listing the approaches in a table See doc for more inf

    STAPL-RTS: A Runtime System for Massive Parallelism

    Get PDF
    Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics. We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms. Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm

    Decision Support Database Management System Acceleration Using Vector Processor

    Get PDF
    English: This work takes a top-down approach to accelerating decision support systems (DSS) on x86-64 microprocessors using true vector ISA extensions. First, a state of art DSS database management system (DBMS) is pro led and bottlenecks are identi ed. From this, the bottlenecked functions are analysed for data-level parallelism and a discussion is given as to why the existing multimedia SIMD extensions (SSE) are not suitable for capturing this parallelism. A vector ISA is derived from what is found to be necessary in these functions; additionally, a complementary microarchitecture is proposed that draws on prior research done in vector microprocessors but is also optimised for the properties found in the pro led application. Finally, the ISA and microarchitecture are implemented and evaluated using a cycle-accurate x86-64 microarchitecture simulator

    Engineering derivatives from biological systems for advanced aerospace applications

    Get PDF
    The present study consisted of a literature survey, a survey of researchers, and a workshop on bionics. These tasks produced an extensive annotated bibliography of bionics research (282 citations), a directory of bionics researchers, and a workshop report on specific bionics research topics applicable to space technology. These deliverables are included as Appendix A, Appendix B, and Section 5.0, respectively. To provide organization to this highly interdisciplinary field and to serve as a guide for interested researchers, we have also prepared a taxonomy or classification of the various subelements of natural engineering systems. Finally, we have synthesized the results of the various components of this study into a discussion of the most promising opportunities for accelerated research, seeking solutions which apply engineering principles from natural systems to advanced aerospace problems. A discussion of opportunities within the areas of materials, structures, sensors, information processing, robotics, autonomous systems, life support systems, and aeronautics is given. Following the conclusions are six discipline summaries that highlight the potential benefits of research in these areas for NASA's space technology programs

    Predicated execution and register windows for out-of-order processors

    Get PDF
    ISA extensions are a very powerful approach to implement new hardware techniques that require or benefit from compiler support: decisions made at compile time can be complemented at runtime, achieving a synergistic effect between the compiler and the processor. This thesis is focused on two ISA extensions: predicate execution and register windows. Predicate execution is exploited by the if-conversion compiler technique. If-conversion removes control dependences by transforming them to data dependences, which helps to exploit ILP beyond a single basic-block. Register windows help to reduce the amount of loads and stores required to save and restore registers across procedure calls by storing multiple contexts into a large architectural register file.In-order processors specially benefit from using both ISA extensions to overcome the limitations that control dependences and memory hierarchy impose on static scheduling. Predicate execution allows to move control dependence instructions past branches. Register windows reduce the amount of memory operations across procedure calls. Although if-conversion and register windows techniques have not been exclusively developed for in-order processors, their use for out-of-order processors has been studied very little. In this thesis we show that the uses of if-conversion and register windows introduce new performance opportunities and new challenges to face in out-of-order processors.The use of if-conversion in out-of-order processors helps to eliminate hard-to-predict branches, alleviating the severe performance penalties caused by branch mispredictions. However, the removal of some conditional branches by if-conversion may adversely affect the predictability of the remaining branches, because it may reduce the amount of correlation information available to the branch predictor. Moreover, predicate execution in out-of-order processors has to deal with two performance issues. First, multiple definitions of the same logical register can be merged into a single control flow, where each definition is guarded with a different predicate. Second, instructions whose guarding predicate evaluates to false consume unnecessary resources. This thesis proposes a branch prediction scheme based on predicate prediction that solves the three problems mentioned above. This scheme, which is built on top of a predicated ISA that implement a compare-and-branch model such as the one considered in this thesis, has two advantages: First, the branch accuracy is improved because the correlation information is not lost after if-conversion and the mechanism we propose permits using the computed value of the branch predicate when available, achieving 100% of accuracy. Second it avoids the predicate out-of-order execution problems.Regarding register windows, we propose a mechanism that reduces physical register requirements of an out-of-order processor to the bare minimum with almost no performance loss. The mechanism is based on identifying which architectural registers are in use by current in-flight instructions. The registers which are not in use, i.e. there is no in-flight instruction that references them, can be early released.In this thesis we propose a very efficient and low-cost hardware implementation of predicate execution and register windows that provide important benefits to out-of-order processors
    corecore