1,869 research outputs found

    Clustered VLIW architecture based on queue register files

    Get PDF
    Institute for Computing Systems ArchitectureInstruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs. This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture

    HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems

    Get PDF
    Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable for the last-level cache (LLC), which typically consumes a large area of silicon die. However, long write latency and high write energy still remain challenges of implementing STT-RAMs in the CPU cache. An increasingly popular method for addressing this challenge involves trading off the non-volatility for reduced write speed and write energy by relaxing the STT-RAM's data retention time. However, in order to maximize energy saving potential, the cache configurations, including STT-RAM's retention time, must be dynamically adapted to executing applications' variable memory needs. In this paper, we propose a highly adaptable last level STT-RAM cache (HALLS) that allows the LLC configurations and retention time to be adapted to applications' runtime execution requirements. We also propose low-overhead runtime tuning algorithms to dynamically determine the best (lowest energy) cache configurations and retention times for executing applications. Compared to prior work, HALLS reduced the average energy consumption by 60.57% in a quad-core system, while introducing marginal latency overhead.Comment: To Appear on IEEE Transactions on Computers (TC

    ENERGY-EFFICIENT LIGHTWEIGHT ALGORITHMS FOR EMBEDDED SMART CAMERAS: DESIGN, IMPLEMENTATION AND PERFORMANCE ANALYSIS

    Get PDF
    An embedded smart camera is a stand-alone unit that not only captures images, but also includes a processor, memory and communication interface. Battery-powered, embedded smart cameras introduce many additional challenges since they have very limited resources, such as energy, processing power and memory. When camera sensors are added to an embedded system, the problem of limited resources becomes even more pronounced. Hence, computer vision algorithms running on these camera boards should be light-weight and efficient. This thesis is about designing and developing computer vision algorithms, which are aware and successfully overcome the limitations of embedded platforms (in terms of power consumption and memory usage). Particularly, we are interested in object detection and tracking methodologies and the impact of them on the performance and battery life of the CITRIC camera (embedded smart camera employed in this research). This thesis aims to prolong the life time of the Embedded Smart platform, without affecting the reliability of the system during surveillance tasks. Therefore, the reader is walked through the whole designing process, from the development and simulation, followed by the implementation and optimization, to the testing and performance analysis. The work presented in this thesis carries out not only software optimization, but also hardware-level operations during the stages of object detection and tracking. The performance of the algorithms introduced in this thesis are comparable to state-of-the-art object detection and tracking methods, such as Mixture of Gaussians, Eigen segmentation, color and coordinate tracking. Unlike the traditional methods, the newly-designed algorithms present notable reduction of the memory requirements, as well as the reduction of memory accesses per pixel. To accomplish the proposed goals, this work attempts to interconnect different levels of the embedded system architecture to make the platform more efficient in terms of energy and resource savings. Thus, the algorithms proposed are optimized at the API, middleware, and hardware levels to access the pixel information of the CMOS sensor directly. Only the required pixels are acquired in order to reduce the unnecessary communications overhead. Experimental results show that when exploiting the architecture capabilities of an embedded platform, 41.24% decrease in energy consumption, and 107.2% increase in battery-life can be accomplished. Compared to traditional object detection and tracking methods, the proposed work provides an additional 8 hours of continuous processing on 4 AA batteries, increasing the lifetime of the camera to 15.5 hours

    Space Applications of Automation, Robotics and Machine Intelligence Systems (ARAMIS). Volume 4: Supplement, Appendix 4.3: Candidate ARAMIS Capabilities

    Get PDF
    Potential applications of automation, robotics, and machine intelligence systems (ARAMIS) to space activities, and to their related ground support functions, in the years 1985-2000, so that NASA may make informed decisions on which aspects of ARAMIS to develop. The study first identifies the specific tasks which will be required by future space projects. It then defines ARAMIS options which are candidates for those space project tasks, and evaluates the relative merits of these options. Finally, the study identifies promising applications of ARAMIS, and recommends specific areas for further research. The ARAMIS options defined and researched by the study group span the range from fully human to fully machine, including a number of intermediate options (e.g., humans assisted by computers, and various levels of teleoperation). By including this spectrum, the study searches for the optimum mix of humans and machines for space project tasks

    Automation of the longwall mining system

    Get PDF
    Cost effective, safe, and technologically sound applications of automation technology to underground coal mining were identified. The longwall analysis commenced with a general search for government and industry experience of mining automation technology. A brief industry survey was conducted to identify longwall operational, safety, and design problems. The prime automation candidates resulting from the industry experience and survey were: (1) the shearer operation, (2) shield and conveyor pan line advance, (3) a management information system to allow improved mine logistics support, and (4) component fault isolation and diagnostics to reduce untimely maintenance delays. A system network analysis indicated that a 40% improvement in productivity was feasible if system delays associated with all of the above four areas were removed. A technology assessment and conceptual system design of each of the four automation candidate areas showed that state of the art digital computer, servomechanism, and actuator technologies could be applied to automate the longwall system

    Towards Accurate Estimation of Error Sensitivity in Computer Systems

    Get PDF
    Fault injection is an increasingly important method for assessing, measuringand observing the system-level impact of hardware and software faults in computer systems. This thesis presents the results of a series of experimental studies in which fault injection was used to investigate the impact of bit-flip errors on program execution. The studies were motivated by the fact that transient hardware faults in microprocessors can cause bit-flip errors that can propagate to the microprocessors instruction set architecture registers and main memory. As the rate of such hardware faults is expected to increase with technology scaling, there is a need to better understand how these errors (known as ‘soft errors’) influence program execution, especially in safety-critical systems.Using ISA-level fault injection, we investigate how five aspects, or factors, influence the error sensitivity of a program. We define error sensitivity as the conditional probability that a bit-flip error in live data in an ISA-register or main-memory word will cause a program to produce silent data corruption (SDC; i.e., an erroneous result). We also consider the estimation of a measure called SDC count, which represents the number of ISA-level bit flips that cause an SDC.The five factors addressed are (a) the inputs processed by a program, (b) the level of compiler optimization, (c) the implementation of the program in the source code, (d) the fault model (single bit flips vs double bit flips) and (e)the fault-injection technique (inject-on-write vs inject-on-read). Our results show that these factors affect the error sensitivity in many ways; some factors strongly impact the error sensitivity or SDC count whereas others show a weaker impact. For example, our experiments show that single bit flips tend to cause SDCs more than double bit flips; compiler optimization positively impacts the SDC count but not necessarily the error sensitivity; the error sensitivity varies between 20% and 50% among the programs we tested; and variations in input affect the error sensitivity significantly for most of the tested programs

    Is there a Moore's law for quantum computing?

    Full text link
    There is a common wisdom according to which many technologies can progress according to some exponential law like the empirical Moore's law that was validated for over half a century with the growth of transistors number in chipsets. As a still in the making technology with a lot of potential promises, quantum computing is supposed to follow the pack and grow inexorably to maturity. The Holy Grail in that domain is a large quantum computer with thousands of errors corrected logical qubits made themselves of thousands, if not more, of physical qubits. These would enable molecular simulations as well as factoring 2048 RSA bit keys among other use cases taken from the intractable classical computing problems book. How far are we from this? Less than 15 years according to many predictions. We will see in this paper that Moore's empirical law cannot easily be translated to an equivalent in quantum computing. Qubits have various figures of merit that won't progress magically thanks to some new manufacturing technique capacity. However, some equivalents of Moore's law may be at play inside and outside the quantum realm like with quantum computers enabling technologies, cryogeny and control electronics. Algorithms, software tools and engineering also play a key role as enablers of quantum computing progress. While much of quantum computing future outcomes depends on qubit fidelities, it is progressing rather slowly, particularly at scale. We will finally see that other figures of merit will come into play and potentially change the landscape like the quality of computed results and the energetics of quantum computing. Although scientific and technological in nature, this inventory has broad business implications, on investment, education and cybersecurity related decision-making processes.Comment: 32 pages, 24 figure

    Cross-Layer Approaches for an Aging-Aware Design of Nanoscale Microprocessors

    Get PDF
    Thanks to aggressive scaling of transistor dimensions, computers have revolutionized our life. However, the increasing unreliability of devices fabricated in nanoscale technologies emerged as a major threat for the future success of computers. In particular, accelerated transistor aging is of great importance, as it reduces the lifetime of digital systems. This thesis addresses this challenge by proposing new methods to model, analyze and mitigate aging at microarchitecture-level and above

    Leveraging register windows to reduce physical registers to the bare minimum

    Get PDF
    Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version
    • …
    corecore