9 research outputs found

    Ultra-thin chips for high-performance flexible electronics

    Get PDF
    Flexible electronics has significantly advanced over the last few years, as devices and circuits from nanoscale structures to printed thin films have started to appear. Simultaneously, the demand for high-performance electronics has also increased because flexible and compact integrated circuits are needed to obtain fully flexible electronic systems. It is challenging to obtain flexible and compact integrated circuits as the silicon based CMOS electronics, which is currently the industry standard for high-performance, is planar and the brittle nature of silicon makes bendability difficult. For this reason, the ultra-thin chips from silicon is gaining interest. This review provides an in-depth analysis of various approaches for obtaining ultra-thin chips from rigid silicon wafer. The comprehensive study presented here includes analysis of ultra-thin chips properties such as the electrical, thermal, optical and mechanical properties, stress modelling, and packaging techniques. The underpinning advances in areas such as sensing, computing, data storage, and energy have been discussed along with several emerging applications (e.g., wearable systems, m-Health, smart cities and Internet of Things etc.) they will enable. This paper is targeted to the readers working in the field of integrated circuits on thin and bendable silicon; but it can be of broad interest to everyone working in the field of flexible electronics

    Addressing Memory Bottlenecks for Emerging Applications

    Full text link
    There has been a recent emergence of applications from the domain of machine learning, data mining, numerical analysis and image processing. These applications are becoming the primary algorithms driving many important user-facing applications and becoming pervasive in our daily lives. Due to their increasing usage in both mobile and datacenter workloads, it is necessary to understand the software and hardware demands of these applications, and design techniques to match their growing needs. This dissertation studies the performance bottlenecks that arise when we try to improve the performance of these applications on current hardware systems. We observe that most of these applications are data-intensive, i.e., they operate on a large amount of data. Consequently, these applications put significant pressure on the memory. Interestingly, we notice that this pressure is not just limited to one memory structure. Instead, different applications stress different levels of the memory hierarchy. For example, training Deep Neural Networks (DNN), an emerging machine learning approach, is currently limited by the size of the GPU main memory. On the other spectrum, improving DNN inference on CPUs is bottlenecked by Physical Register File (PRF) bandwidth. Concretely, this dissertation tackles four such memory bottlenecks for these emerging applications across the memory hierarchy (off-chip memory, on-chip memory and physical register file), presenting hardware and software techniques to address these bottlenecks and improve the performance of the emerging applications. For on-chip memory, we present two scenarios where emerging applications perform at a sub-optimal performance. First, many applications have a large number of marginal bits that do not contribute to the application accuracy, wasting unnecessary space and transfer costs. We present ACME, an asymmetric compute-memory paradigm, that removes marginal bits from the memory hierarchy while performing the computation in full precision. Second, we tackle the contention in shared caches for these emerging applications that arise in datacenters where multiple applications can share the same cache capacity. We present ShapeShifter, a runtime system that continuously monitors the runtime environment, detects changes in the cache availability and dynamically recompiles the application on the fly to efficiently utilize the cache capacity. For physical register file, we observe that DNN inference on CPUs is primarily limited by the PRF bandwidth. Increasing the number of compute units in CPU requires increasing the read ports in the PRF. In this case, PRF quickly reaches a point where latency could no longer be met. To solve this problem, we present LEDL, locality extensions for deep learning on CPUs, that entails a rearchitected FMA and PRF design tailored for the heavy data reuse inherent in DNN inference. Finally, a significant challenge facing both the researchers and industry practitioners is that as the DNNs grow deeper and larger, the DNN training is limited by the size of the GPU main memory, restricting the size of the networks which GPUs can train. To tackle this challenge, we first identify the primary contributors to this heavy memory footprint, finding that the feature maps (intermediate layer outputs) are the heaviest contributors in training as opposed to the weights in inference. Then, we present Gist, a runtime system, that uses three efficient data encoding techniques to reduce the footprint of DNN training.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146016/1/anijain_1.pd

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    Performance Analysis of Complex Shared Memory Systems

    Get PDF
    Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

    Adaptive Distributed Architectures for Future Semiconductor Technologies.

    Full text link
    Year after year semiconductor manufacturing has been able to integrate more components in a single computer chip. These improvements have been possible through systematic shrinking in the size of its basic computational element, the transistor. This trend has allowed computers to progressively become faster, more efficient and less expensive. As this trend continues, experts foresee that current computer designs will face new challenges, in utilizing the minuscule devices made available by future semiconductor technologies. Today's microprocessor designs are not fit to overcome these challenges, since they are constrained by their inability to handle component failures by their lack of adaptability to a wide range of custom modules optimized for specific applications and by their limited design modularity. The focus of this thesis is to develop original computer architectures, that can not only survive these new challenges, but also leverage the vast number of transistors available to unlock better performance and efficiency. The work explores and evaluates new software and hardware techniques to enable the development of novel adaptive and modular computer designs. The thesis first explores an infrastructure to quantitatively assess the fallacies of current systems and their inadequacy to operate on unreliable silicon. In light of these findings, specific solutions are then proposed to strengthen digital system architectures, both through hardware and software techniques. The thesis culminates with the proposal of a radically new architecture design that can fully adapt dynamically to operate on the hardware resources available on chip, however limited or abundant those may be.PHDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102405/1/apellegr_1.pd

    Ultra-thin silicon technology for tactile sensors

    Get PDF
    In order to meet the requirements of high performance flexible electronics in fast growing portable consumer electronics, robotics and new fields such as Internet of Things (IoT), new techniques such as electronics based on nanostructures, molecular electronics and quantum electronics have emerged recently. The importance given to the silicon chips with thickness below 50 μm is particularly interesting as this will advance the 3D IC technology as well as open new directions for high-performance flexible electronics. This doctoral thesis focusses on the development of silicon–based ultra-thin chip (UTC) for the next generation flexible electronics. UTCs, on one hand can provide processing speed at par with state-of-the-art CMOS technology, and on the other provide the mechanical flexibility to allow smooth integration on flexible substrates. These development form the motivation behind the work presented in this thesis. As the thickness of any silicon piece decreases, the flexural rigidity decreases. The flexural rigidity is defined as the force couple required to bend a non-rigid structure to a unit curvature, and therefore the flexibility increases. The new approach presented in this thesis for achieving thin silicon exploits existing and well-established silicon infrastructure, process, and design modules. The thin chips of thicknesses ranging between 15 μm – 30 μm, were obtained from processed bulk wafer using anisotropic chemical etching. The thesis also presents thin wafer transfer using two-step transfer printing approach, packaging by lamination or encapsulation between two flexible layerand methods to get the electrical connections out of the chip. The devices realised on the wafer as part of front-end processing, consisted capacitors and transistors, have been tested to analyse the effect of bending on the electrical characteristics. The capacitance of metal-oxide-semiconductor (MOS) capacitors increases by ~5% during bending and similar shift is observed in flatband and threshold voltages. Similarly, the carrier mobility in the channel region of metal-oxide-semiconductor field effect transistor (MOSFET) increases by 9% in tensile bending and decreases by ~5% in compressive bending. The analytical model developed to capture the effect of banding on device performance showed close matching with the experimental results. In order to employ these devices as tactile sensors, two types of piezoelectric materials are investigated, and used in extended gate configuration with the MOSFET. Firstly, a nanocomposite of Poly(vinylidene fluoride-co-trifluoroethylene), P(VDF-TrFE) and barium titanate (BT) was developed. The composite, due to opposite piezo and pyroelectric coefficients of constituents, was able to suppress the sensitivity towards temperature when force and temperature varied together, The sensitivity to force in extended gate configuration was measured to be 630 mV/N, and sensitivity to temperature was 6.57 mV/oC, when it was varied during force application. The process optimisation for sputtering piezoelectric Aluminium Nitride (AlN) was also carried out with many parametric variation. AlN does not require poling to exhibit piezoelectricity and therefore offers an attractive alternative for the piezoelectric layer used in devices such as POSFET (where piezoelectric material is directly deposited over the gate area of MOSFET). The optimised process gave highly orientated columnar structure AlN with piezoelectric coefficient of 5.9 pC/N and when connected in extended gate configuration, a sensitivity (normalised change in drain current per unit force) of 2.65 N-1 was obtained

    RUNTIME METHODS TO IMPROVE ENERGY EFFICIENCY IN SUPERCOMPUTING APPLICATIONS

    Get PDF
    Energy efficiency in supercomputing is critical to limit operating costs and carbon footprints. While the energy efficiency of future supercomputing centers needs to improve at all levels, the energy consumed by the processing units is a large fraction of the total energy consumed by High Performance Computing (HPC) systems. HPC applications use a parallel programming paradigm like the Message Passing Interface (MPI) to coordinate computation and communication among thousands of processors. With dynamically-changing factors both in hardware and software affecting energy usage of processors, there exists a need for power monitoring and regulation at runtime to achieve savings in energy. This dissertation highlights an adaptive runtime framework that enables processors with core-specific power control by dynamically adapting to workload characteristics to reduce power with little or no performance impact. Two opportunities to improve the energy efficiency of processors running MPI applications are identified - computational workload imbalance and waiting on memory. Monitoring of performance and power regulation is performed by the framework transparently within the MPI runtime system, eliminating the need for code changes to MPI applications. The effect of enforcing power limits (capping) on processors is also investigated. Experiments on 32 nodes (1024 cores) show that in presence of workload imbalance, the runtime reduces Central Processing Unit (CPU) frequency on cores not on the critical path, thereby reducing power and hence energy usage without deteriorating performance. Using this runtime, six MPI mini-applications and a full MPI application show an overall 20% decrease in energy use with less than 1% increase in execution time. In addition, the lowering of frequency on non-critical cores reduces run-to-run performance variation and improves performance. For the full application, an average speedup of 11% is seen, while the power is lowered by about 31% for an energy savings of up to 42%. Another experiment on 16 nodes (256 cores) that are power capped also shows performance improvement along with power reduction. Thus, energy optimization can also be a performance optimization. For applications that are limited by memory access times, memory metrics identified facilitate lowering of power by up to 32% without adversely impacting performance.Doctor of Philosoph
    corecore