14 research outputs found

    Novel Architectures for Offloading and Accelerating Computations in Artificial Intelligence and Big Data

    Get PDF
    Due to the end of Moore's Law and Dennard Scaling, performance gains in general-purpose architectures have significantly slowed in recent years. While raising the number of cores has been a viable approach for further performance increases, Amdahl's Law and its implications on parallelization also limit further performance gains. Consequently, research has shifted towards different approaches, including domain-specific custom architectures tailored to specific workloads. This has led to a new golden age for computer architecture, as noted in the Turing Award Lecture by Hennessy and Patterson, which has spawned several new architectures and architectural advances specifically targeted at highly current workloads, including Machine Learning. This thesis introduces a hierarchy of architectural improvements ranging from minor incremental changes, such as High-Bandwidth Memory, to more complex architectural extensions that offload workloads from the general-purpose CPU towards more specialized accelerators. Finally, we introduce novel architectural paradigms, namely Near-Data or In-Network Processing, as the most complex architectural improvements. This cumulative dissertation then investigates several architectural improvements to accelerate Sum-Product Networks, a novel Machine Learning approach from the class of Probabilistic Graphical Models. Furthermore, we use these improvements as case studies to discuss the impact of novel architectures, showing that minor and major architectural changes can significantly increase performance in Machine Learning applications. In addition, this thesis presents recent works on Near-Data Processing, which introduces Smart Storage Devices as a novel architectural paradigm that is especially interesting in the context of Big Data. We discuss how Near-Data Processing can be applied to improve performance in different database settings by offloading database operations to smart storage devices. Offloading data-reductive operations, such as selections, reduces the amount of data transferred, thus improving performance and alleviating bandwidth-related bottlenecks. Using Near-Data Processing as a use-case, we also discuss how Machine Learning approaches, like Sum-Product Networks, can improve novel architectures. Specifically, we introduce an approach for offloading Cardinality Estimation using Sum-Product Networks that could enable more intelligent decision-making in smart storage devices. Overall, we show that Machine Learning can benefit from developing novel architectures while also showing that Machine Learning can be applied to improve the applications of novel architectures

    Método para legalização de circuitos com células de altura múltipla

    Get PDF
    Desde a década de 1970, novas tecnologias de semicondutores impactam nossa sociedade. Desde então, o número de componentes num mesmo circuito é dobrado a cada dois anos, seguindo a Lei de Moore. Com esse avanço, os microprocessadores atuais possuem bilhões de transistores nos seus circuitos. Porém, esses avanços impuseram regras de projeto que trouxeram novos desafios para as etapas de otimização. Para auxiliar nesses obstáculos foi preciso utilizar softwares de EDA (do inglês Eletronic Design Automation). Hoje em dia, as ferramentas EDA são usadas em projetos de fluxo de células padrão desde seus estágios iniciais até finais. O fluxo de células padrão é composto por uma sequência de fluxos para elaborar e sintetizar o circuito. Dentre estes fluxo está o fluxo da síntese, onde está a etapa de posicionamento. Uma das etapas do posicionamento é a legaliza ção cujo objetivo é mover as células para posições válidas e remover suas sobreposições. A legalização possui desafios como o número de células, células de altura mista, rote abilidade, comprimento de fio, regras de projeto complexas como regiões de fence. Os métodos de legalização são categorizados em heurísticos e analíticos para enfrentar esses desafios. Neste trabalho, é proposto um método heurístico de legalização de células de altura mista. Este trabalho foi baseado em etapas de trabalhos existentes na literatura. Além disso, nossa legalização é aplicada incrementalmente através de etapas que priorizam as células que violam suas regiões, após isso prioriza grupos de células sobrepostas. Os experimentos realizados mostram que nosso método proposto permite reduzir mais do que 15% o tempo de execução e a diferença nos resultados é mínima em comprimento de fio.Since the 1970s, new semiconductor technologies have impacted our society. After that, the number of components in the same circuit is doubled every two years, following Moore’s Law. Today’s microprocessors have billions of transistors in their circuits with this advancement. However, these advances imposed design rules that brought new chal lenges to the optimization steps. EDA (Electronic Design Automation) software was nec essary to help with these obstacles. EDA tools use in standard cell flow designs from their initial to final stages. The standard cell flow comprises a sequence of flows to elaborate and synthesize the circuit. Among these flows is the physical synthesis flow, which is the placement step. One of the placement steps is legalization which aims to move cells to proper positions and remove their overlaps. Legalization has challenges like the number of cells, mixed-cell height, routability, wirelength, and complex design rules like fence regions. Legalization methods are categorized into heuristic and analytical to address these challenges. This work proposes a heuristic method of legalization of mixed-height cells. This work is based on steps of existing works in the literature. Additionally, our legalization is applied incrementally through steps that prioritize cells that violate their regions, then prioritize groups of overlapping cells. The experiments show that our pro posed method allows us to reduce the execution time by more than 15%, and the difference in the results is minimal in terms of wirelength

    Custom Cell Placement Automation for Asynchronous VLSI

    Get PDF
    Asynchronous Very-Large-Scale-Integration (VLSI) integrated circuits have demonstrated many advantages over their synchronous counterparts, including low power consumption, elastic pipelining, robustness against manufacturing and temperature variations, etc. However, the lack of dedicated electronic design automation (EDA) tools, especially physical layout automation tools, largely limits the adoption of asynchronous circuits. Existing commercial placement tools are optimized for synchronous circuits, and require a standard cell library provided by semiconductor foundries to complete the physical design. The physical layouts of cells in this library have the same height to simplify the placement problem and the power distribution network. Although the standard cell methodology also works for asynchronous designs, the performance is inferior compared with counterparts designed using the full-custom design methodology. To tackle this challenge, we propose a gridded cell layout methodology for asynchronous circuits, in which the cell height and cell width can be any integer multiple of two grid values. The gridded cell approach combines the shape regularity of standard cells with the size flexibility of full-custom layouts. Therefore, this approach can achieve a better space utilization ratio and lower wire length for asynchronous designs. Experiments have shown that the gridded cell placement approach reduces area without impacting the routability. We have also used this placer to tape out a chip in a 65nm process technology, demonstrating that our placer generates design-rule clean results

    Developing Trustworthy Hardware with Security-Driven Design and Verification

    Full text link
    Over the past several decades, computing hardware has evolved to become smaller, yet more performant and energy-efficient. Unfortunately these advancements have come at a cost of increased complexity, both physically and functionally. Physically, the nanometer-scale transistors used to construct Integrated Circuits (ICs), have become astronomically expensive to fabricate. Functionally, ICs have become increasingly dense and feature rich to optimize application-specific tasks. To cope with these trends, IC designers outsource both fabrication and portions of Register-Transfer Level (RTL) design. Outsourcing, combined with the increased complexity of modern ICs, presents a security risk: we must trust our ICs have been designed and fabricated to specification, i.e., they do not contain any hardware Trojans. Working in a bottom-up fashion, I initially study the threat of outsourcing fabrication. While prior work demonstrates fabrication-time attacks (modifications) on IC layouts, it is unclear what makes a layout vulnerable to attack. To answer this, in my IC Attack Surface (ICAS) work, I develop a framework that quantifies the security of IC layouts. Using ICAS, I show that modern ICs leave a plethora of both placement and routing resources available for attackers to exploit. Next, to plug these gaps, I construct the first routing-centric defense (T-TER) against fabrication-time Trojans. T-TER wraps security-critical interconnects in IC layouts with tamper-evident guard wires to prevent foundry-side attackers from modifying a design. After hardening layouts against fabrication-time attacks, outsourced designs become the most critical threat. To address this, I develop a dynamic verification technique (Bomberman) to vet untrusted third-party RTL hardware for Ticking Timebomb Trojans (TTTs). By targeting a specific type of Trojan behavior, Bomberman does not suffer from false negatives (missed TTTs), and therefore systematically reduces the overall design-time attack surface. Lastly, to generalize the Bomberman approach to automatically discover other behaviorally-defined classes of malicious logic, I adapt coverage-guided software fuzzers to the RTL verification domain. Leveraging software fuzzers for RTL verification enables IC design engineers to optimize test coverage of third-party designs without intimate implementation knowledge. Overall, this dissertation aims to make security a first-class design objective, alongside power, performance, and area, throughout the hardware development process.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169761/1/trippel_1.pd

    Rapid SoC Design: On Architectures, Methodologies and Frameworks

    Full text link
    Modern applications like machine learning, autonomous vehicles, and 5G networking require an order of magnitude boost in processing capability. For several decades, chip designers have relied on Moore’s Law - the doubling of transistor count every two years to deliver improved performance, higher energy efficiency, and an increase in transistor density. With the end of Dennard’s scaling and a slowdown in Moore’s Law, system architects have developed several techniques to deliver on the traditional performance and power improvements we have come to expect. More recently, chip designers have turned towards heterogeneous systems comprised of more specialized processing units to buttress the traditional processing units. These specialized units improve the overall performance, power, and area (PPA) metrics across a wide variety of workloads and applications. While the GPU serves as a classical example, accelerators for machine learning, approximate computing, graph processing, and database applications have become commonplace. This has led to an exponential growth in the variety (and count) of these compute units found in modern embedded and high-performance computing platforms. The various techniques adopted to combat the slowing of Moore’s Law directly translates to an increase in complexity for modern system-on-chips (SoCs). This increase in complexity in turn leads to an increase in design effort and validation time for hardware and the accompanying software stacks. This is further aggravated by fabrication challenges (photo-lithography, tooling, and yield) faced at advanced technology nodes (below 28nm). The inherent complexity in modern SoCs translates into increased costs and time-to-market delays. This holds true across the spectrum, from mobile/handheld processors to high-performance data-center appliances. This dissertation presents several techniques to address the challenges of rapidly birthing complex SoCs. The first part of this dissertation focuses on foundations and architectures that aid in rapid SoC design. It presents a variety of architectural techniques that were developed and leveraged to rapidly construct complex SoCs at advanced process nodes. The next part of the dissertation focuses on the gap between a completed design model (in RTL form) and its physical manifestation (a GDS file that will be sent to the foundry for fabrication). It presents methodologies and a workflow for rapidly walking a design through to completion at arbitrary technology nodes. It also presents progress on creating tools and a flow that is entirely dependent on open-source tools. The last part presents a framework that not only speeds up the integration of a hardware accelerator into an SoC ecosystem, but emphasizes software adoption and usability.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168119/1/ajayi_1.pd

    Dynamic reconfiguration frameworks for high-performance reliable real-time reconfigurable computing

    Get PDF
    The sheer hardware-based computational performance and programming flexibility offered by reconfigurable hardware like Field-Programmable Gate Arrays (FPGAs) make them attractive for computing in applications that require high performance, availability, reliability, real-time processing, and high efficiency. Fueled by fabrication process scaling, modern reconfigurable devices come with ever greater quantities of on-chip resources, allowing a more complex variety of applications to be developed. Thus, the trend is that technology giants like Microsoft, Amazon, and Baidu now embrace reconfigurable computing devices likes FPGAs to meet their critical computing needs. In addition, the capability to autonomously reprogramme these devices in the field is being exploited for reliability in application domains like aerospace, defence, military, and nuclear power stations. In such applications, real-time computing is important and is often a necessity for reliability. As such, applications and algorithms resident on these devices must be implemented with sufficient considerations for real-time processing and reliability. Often, to manage a reconfigurable hardware device as a computing platform for a multiplicity of homogenous and heterogeneous tasks, reconfigurable operating systems (ROSes) have been proposed to give a software look to hardware-based computation. The key requirements of a ROS include partitioning, task scheduling and allocation, task configuration or loading, and inter-task communication and synchronization. Existing ROSes have met these requirements to varied extents. However, they are limited in reliability, especially regarding the flexibility of placing the hardware circuits of tasks on device’s chip area, the problem arising more from the partitioning approaches used. Indeed, this problem is deeply rooted in the static nature of the on-chip inter-communication among tasks, hampering the flexibility of runtime task relocation for reliability. This thesis proposes the enabling frameworks for reliable, available, real-time, efficient, secure, and high-performance reconfigurable computing by providing techniques and mechanisms for reliable runtime reconfiguration, and dynamic inter-circuit communication and synchronization for circuits on reconfigurable hardware. This work provides task configuration infrastructures for reliable reconfigurable computing. Key features, especially reliability-enabling functionalities, which have been given little or no attention in state-of-the-art are implemented. These features include internal register read and write for device diagnosis; configuration operation abort mechanism, and tightly integrated selective-area scanning, which aims to optimize access to the device’s reconfiguration port for both task loading and error mitigation. In addition, this thesis proposes a novel reliability-aware inter-task communication framework that exploits the availability of dedicated clocking infrastructures in a typical FPGA to provide inter-task communication and synchronization. The clock buffers and networks of an FPGA use dedicated routing resources, which are distinct from the general routing resources. As such, deploying these dedicated resources for communication sidesteps the restriction of static routes and allows a better relocation of circuits for reliability purposes. For evaluation, a case study that uses a NASA/JPL spectrometer data processing application is employed to demonstrate the improved reliability brought about by the implemented configuration controller and the reliability-aware dynamic communication infrastructure. It is observed that up to 74% time saving can be achieved for selective-area error mitigation when compared to state-of-the-art vendor implementations. Moreover, an improvement in overall system reliability is observed when the proposed dynamic communication scheme is deployed in the data processing application. Finally, one area of reconfigurable computing that has received insufficient attention is security. Meanwhile, considering the nature of applications which now turn to reconfigurable computing for accelerating compute-intensive processes, a high premium is now placed on security, not only of the device but also of the applications, from loading to runtime execution. To address security concerns, a novel secure and efficient task configuration technique for task relocation is also investigated, providing configuration time savings of up to 32% or 83%, depending on the device; and resource usage savings in excess of 90% compared to state-of-the-art

    A Metaheuristic Method for Fast Multi-Deck Legalization

    Get PDF
    Department of Electrical EngineeringIn the field of circuit design, decreasing the transistor size is getting harder and harder. Hence, improving the circuit performance also becoming difficult. For the better circuit performance, various technologies are being tired and multi-deck standard cell technology is one of them. The standard cell methodology is a fundamental structure of EDA (Electric Design Automation). Using the standard cell library, EDA tools can easily design, and optimize the physical design of chips. In order to conventional standard cell, multi-deck standard cell occupies multiple rows on the chip. This multiple occupation increases complexity of the circuit physical design for EDA tools. Thus, legalization problem has become more challenging for the multi-deck standard cells. Recently, various multi-deck legalization methods are proposed because the conventional single-deck legalization method is not effective for multi-deck legalization. A state-of-the-arts legalization method is based on quadratic programming with the linear complementary problem(LCP). However, these previous researches can only cover the double-deck case because of runtime burden. In this thesis, we propose the fast and enhanced the multi-deck standard cell legalization algorithm which can handle higher than double-deck standard cell cases. The proposed legalization method achieves the most fastest runtime result for the dominant number of benchmarks on ICCAD Contest 2017 [1] compared with Top 3 results.ope

    Layoutautomatisierung im analogen IC-Entwurf mit formalisiertem und nicht-formalisiertem Expertenwissen

    Get PDF
    After more than three decades of electronic design automation, most layouts for analog integrated circuits are still handcrafted in a laborious manual fashion today. Obverse to the highly automated synthesis tools in the digital domain (coping with the quantitative difficulty of packing more and more components onto a single chip – a desire well known as More Moore), analog layout automation struggles with the many diverse and heavily correlated functional requirements that turn the analog design problem into a More than Moore challenge. Facing this qualitative complexity, seasoned layout engineers rely on their comprehensive expert knowledge to consider all design constraints that uncompromisingly need to be satisfied. This usually involves both formally specified and nonformally communicated pieces of expert knowledge, which entails an explicit and implicit consideration of design constraints, respectively. Existing automation approaches can be basically divided into optimization algorithms (where constraint consideration occurs explicitly) and procedural generators (where constraints can only be taken into account implicitly). As investigated in this thesis, these two automation strategies follow two fundamentally different paradigms denoted as top-down automation and bottom-up automation. The major trait of top-down automation is that it requires a thorough formalization of the problem to enable a self-intelligent solution finding, whereas a bottom-up automatism –controlled by parameters– merely reproduces solutions that have been preconceived by a layout expert in advance. Since the strengths of one paradigm may compensate the weaknesses of the other, it is assumed that a combination of both paradigms –called bottom-up meets top-down– has much more potential to tackle the analog design problem in its entirety than either optimization-based or generator-based approaches alone. Against this background, the thesis at hand presents Self-organized Wiring and Arrangement of Responsive Modules (SWARM), an interdisciplinary methodology addressing the design problem with a decentralized multi-agent system. Its basic principle, similar to the roundup of a sheep herd, is to let responsive mobile layout modules (implemented as context-aware procedural generators) interact with each other inside a user-defined layout zone. Each module is allowed to autonomously move, rotate and deform itself, while a supervising control organ successively tightens the layout zone to steer the interaction towards increasingly compact (and constraint compliant) layout arrangements. Considering various principles of self-organization and incorporating ideas from existing decentralized systems, SWARM is able to evoke the phenomenon of emergence: although each module only has a limited viewpoint and selfishly pursues its personal objectives, remarkable overall solutions can emerge on the global scale. Several examples exhibit this emergent behavior in SWARM, and it is particularly interesting that even optimal solutions can arise from the module interaction. Further examples demonstrate SWARM’s suitability for floorplanning purposes and its application to practical place-and-route problems. The latter illustrates how the interacting modules take care of their respective design requirements implicitly (i.e., bottom-up) while simultaneously paying respect to high level constraints (such as the layout outline imposed top-down by the supervising control organ). Experimental results show that SWARM can outperform optimization algorithms and procedural generators both in terms of layout quality and design productivity. From an academic point of view, SWARM’s grand achievement is to tap fertile virgin soil for future works on novel bottom-up meets top-down automatisms. These may one day be the key to close the automation gap in analog layout design.Nach mehr als drei Jahrzehnten Entwurfsautomatisierung werden die meisten Layouts für analoge integrierte Schaltkreise heute immer noch in aufwändiger Handarbeit entworfen. Gegenüber den hochautomatisierten Synthesewerkzeugen im Digitalbereich (die sich mit dem quantitativen Problem auseinandersetzen, mehr und mehr Komponenten auf einem einzelnen Chip unterzubringen – bestens bekannt als More Moore) kämpft die analoge Layoutautomatisierung mit den vielen verschiedenen und stark korrelierten funktionalen Anforderungen, die das analoge Entwurfsproblem zu einer More than Moore Herausforderung machen. Angesichts dieser qualitativen Komplexität bedarf es des umfassenden Expertenwissens erfahrener Layouter um sämtliche Entwurfsconstraints, die zwingend eingehalten werden müssen, zu berücksichtigen. Meist beinhaltet dies formal spezifiziertes als auch nicht-formal übermitteltes Expertenwissen, was eine explizite bzw. implizite Constraint Berücksichtigung nach sich zieht. Existierende Automatisierungsansätze können grundsätzlich unterteilt werden in Optimierungsalgorithmen (wo die Constraint Berücksichtigung explizit erfolgt) und prozedurale Generatoren (die Constraints nur implizit berücksichtigen können). Wie in dieser Arbeit eruiert wird, folgen diese beiden Automatisierungsstrategien zwei grundlegend unterschiedlichen Paradigmen, bezeichnet als top-down Automatisierung und bottom-up Automatisierung. Wesentliches Merkmal der top-down Automatisierung ist die Notwendigkeit einer umfassenden Problemformalisierung um eine eigenintelligente Lösungsfindung zu ermöglichen, während ein bottom-up Automatismus –parametergesteuert– lediglich Lösungen reproduziert, die vorab von einem Layoutexperten vorgedacht wurden. Da die Stärken des einen Paradigmas die Schwächen des anderen ausgleichen können, ist anzunehmen, dass eine Kombination beider Paradigmen –genannt bottom-up meets top down– weitaus mehr Potenzial hat, das analoge Entwurfsproblem in seiner Gesamtheit zu lösen als optimierungsbasierte oder generatorbasierte Ansätze für sich allein. Vor diesem Hintergrund stellt die vorliegende Arbeit Self-organized Wiring and Arrangement of Responsive Modules (SWARM) vor, eine interdisziplinäre Methodik, die das Entwurfsproblem mit einem dezentralisierten Multi-Agenten-System angeht. Das Grundprinzip besteht darin, ähnlich dem Zusammentreiben einer Schafherde, reaktionsfähige mobile Layoutmodule (realisiert als kontextbewusste prozedurale Generatoren) in einer benutzerdefinierten Layoutzone interagieren zu lassen. Jedes Modul darf sich selbständig bewegen, drehen und verformen, wobei ein übergeordnetes Kontrollorgan die Zone schrittweise verkleinert, um die Interaktion auf zunehmend kompakte (und constraintkonforme) Layoutanordnungen hinzulenken. Durch die Berücksichtigung diverser Selbstorganisationsgrundsätze und die Einarbeitung von Ideen bestehender dezentralisierter Systeme ist SWARM in der Lage, das Phänomen der Emergenz hervorzurufen: obwohl jedes Modul nur eine begrenzte Sichtweise hat und egoistisch seine eigenen Ziele verfolgt, können sich auf globaler Ebene bemerkenswerte Gesamtlösungen herausbilden. Mehrere Beispiele veranschaulichen dieses emergente Verhalten in SWARM, wobei besonders interessant ist, dass sogar optimale Lösungen aus der Modulinteraktion entstehen können. Weitere Beispiele demonstrieren SWARMs Eignung zwecks Floorplanning sowie die Anwendung auf praktische Place-and-Route Probleme. Letzteres verdeutlicht, wie die interagierenden Module ihre jeweiligen Entwurfsanforderungen implizit (also: bottom-up) beachten, während sie gleichzeitig High-Level-Constraints berücksichtigen (z.B. die Layoutkontur, die top-down vom übergeordneten Kontrollorgan auferlegt wird). Experimentelle Ergebnisse zeigen, dass Optimierungsalgorithmen und prozedurale Generatoren von SWARM sowohl bezüglich Layoutqualität als auch Entwurfsproduktivität übertroffen werden können. Aus akademischer Sicht besteht SWARMs große Errungenschaft in der Erschließung fruchtbaren Neulands für zukünftige Arbeiten an neuartigen bottom-up meets top-down Automatismen. Diese könnten eines Tages der Schlüssel sein, um die Automatisierungslücke im analogen Layoutentwurf zu schließen
    corecore