Search CORE

4 research outputs found

클럭 게이팅 및 플립 플롭 동시 최적화를 위한 설계 및 알고리즘

Author: 양기용
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2019. 2. 김태환.본 논문에서는 표준 셀에서부터 배치 단계에 이르는 다양한 설계단에에서 칩의 동적 전력을 최적화 기법을 소개한다. 이 연구는 우선 데이터 구동형 (즉, 토글링 기반) 클럭 게이팅이 종래 클럭 게이팅 기법들에서 결코 다루어지지 않았던 플립 플 롭의 합성과 밀접하게 통합될 수 있는 방법을 연구한다. 우리의 관측의 핵심은 플립 플롭 셀의 일부 내부 부품이 클럭 게이팅 인에이블 신호를 생성 하기 위해 재사용 될 수 있다는 것이다. 이를 바탕으로 eXOR-FF 라고 불리는 새롭게 최적화된 플립 플롭 배선 구조를 제안합니다. 이 구조에서는 매 클럭 주기마다 내부 로직을 재사용 하여 클럭 게이팅을 통해 플립 플롭을 활성화할지 또는 비활성화할지 결정합니다. 모든 쌍의 플립 플롭 및 토글릴 감지 로직에서의 영역을 절약함에 따라서 누설 및 동적 전력의 절전 효과를 달성합니다. 그런 다음, 두 가지고유한 장점을 제공하는 배치/타이밍 인식 클럭 게이팅 탐색에 대한 포괄적인 방법론을 제안합니다. 해당 방 법론은 eXOR-FF 의 이점을 극대화하고, 전력 소비 및 타이밍 영향의 분해에 대한 정밀 분석을 수행하고 틀럭 게이팅 참색의 핵심 엔진을 비용기능으로 변환하는데 가장 적합합니다. ISCAS89, ITC89, ITC99 및 IWLS 2005의 벤치 마크 회로를 사용 한 실험을 통해 제안 된 방법이 이전의 데이터 구동 클록 게이팅 방식과 비교하여 총 전력을 5.6 % 및 면적으로 5.3 % 줄일 수 있음을 보여 주었다.In this paper, we introduce dynamic power optimization techniques applicable for various design stage from standard cell to placement stage. This work firstly investi�gates the problem of how designing data-driven (i.e., toggling based) clock gating can be closely integrated with the synthesis of flip-flops, which has never been addressed in the prior clock gating works. Our key observation is that some internal part of a flip-flop cell can be reused to generate its clock gating enable signal. Based on this, we propose a newly optimized flip-flop wiring structure, called eXOR-FF, in which an internal logic can be reused for every clock cycle to decide if the flip-flop is to be activated or inactivated through clock gating, thereby achieving area saving (thus, leakage as well as dynamic power saving) on every pair of flip-flop and its toggling detection logic. Then, we propose a comprehensive methodology of placement/timing�aware clock gating exploration that provides two unique strengths: best suited for max�imally exploiting the benefit of eXOR-FFs and precise analyses on the decomposition of power consumptions and timing impact, and translating them into cost functions in core engine of clock gating exploration. Through experiments with benchmark circuits in ISCAS89, ITC89, ITC99 and IWLS 2005, it is shown that our proposed method is able to reduce the total power by 5.6% and total cell area by 5.3% compared with the previous data-driven clock gating method in [1].Abstract Contents List of Tables List of Figures 1 Introduction 1.1 Power Consumption in CMOS Digital Design 1.2 Low Power Design Methodologies 1.3 Contribution of This Thesis 2 Preliminary and Motivations 6 2.1 Background 2.2 Observation on Area and Power Saving 2.3 Observation on Timing Impact 3 Redesign of Flip-flops Specialized for Clock Gating 3.1 Observation on Area Impact 4 Placement-aware Clock Gating Methodology Utilizing eXOR-FF Cells 4.1 Overall Design Flow 4.2 Cost Formulation for Conventional Clock Gating 4.3 Cost Formulation for Our Clock Gating using eXOR-FFs 5 Experiments 5.1 Experimental Setup 5.2 Experimental Results 5.3 Comparing with Industry Algorithm 6 Conclusion Abstract (In Korean)Maste

SNU Open Repository and Archive

정확하고 학습 기반 전력 분석을 기반으로 하는 클록 게이팅의 합성

Author: 박소라
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2023. 2. 김태환.In this paper, we introduce two techniques to efficiently apply clock gating in the synthesis stage. First, We propose a new clock gating methodology based on a precise power saving analysis to overcome the ineffectiveness of the conventional logic structure based clock gating. Two new features exploited in our proposed clock gating are (i) the multiplexer selection signal probability that a flip-flop with multiplexer feedback loop receives a new input and (ii) the joint probability of selection signals that two flip-flops with different multiplexor selection signals both receive new inputs at the same clock cycle. In summary, our method reduces the total power consumption by 2.46% on average (up to 5.00%) over the conventional clock gating method. In the second work, we address a new problem of transforming the long toggling/untoggling sequences of flip-flops cycle-accurate activities into short embedding vectors, so that the flip-flop grouping for clock gating is practically feasible in terms of the memory usage and run time for checking activity similarity among flip-flops. To this end, we propose a machine learning based generation of embedding vectors which are accurate enough to predict the original flip-flop toggling sequences. Precisely, we develop a neural network model of LSTM (long short-term memory) based AE(autoencoder) model combined with SDAE (stacked denoising autoencoder) to take into account the time-series (i.e., clock cycle) similarity feature among the toggling sequences, which is essential to determine which flip-flops should be grouped together for clock gating. By integrating (1) our LSTM based embedding vector generation model, we propose two additional ML models for clock gating: (2) joint state probability predictor (JSP) model for generating 0-state probability of two embedding vectors, and (3) joint feature predictor (JFP) model for generating a new embedding vector that combines two embedding vectors. Through experiments, it is confirmed that our proposed LSTM combined with AutoEnc improves the toggling sequence prediction accuracy up to 0.88 while an LSTM (long short-term memory) based AE model produces accuracy to 0.72, thereby enabling our ML based clock gating framework to save the dynamic power consumption further over that by the state-of-the-art commercial clock gating tool, which relies on the flip-flops toggling probability for grouping flip-flops. Through experiments with benchmark circuits in IWLS, it is shown that our method is able to reduce the dynamic power by 14.0% on average over that by the conventional toggling-driven clock gating.본 논문에서는 합성 단계에서 클록 게이팅을 효율적으로 적용하기 위한 두 가지 기법을 소개한다. 첫째로, 클록 게이팅 기반의 기존 로직 구조의 비효율성을 극복하기 위해 정밀 한 절전 분석을 기반으로 한 새로운 클록 게이팅 방법론을 제안한다. 제안된 클록 게이팅 방법에서 활용되는 두 가지 새로운 기능은 (i) 피드백 루프가 있는 플립플롭 의 멀티플렉서 선택 신호 확률 및 (ii) 서로 다른 멀티플렉서 선택 신호를 갖는 두 플립플롭의 멀티플렉서 선택 신호 결합 확률이다. 전력 이득이 있는 경우에만 클록 게이팅을 적용하고 서로 다른 클록 게이팅 그룹을 통합함으로서 전체 동적 전력를 줄이고자 하였다. 실험을 통해 기존의 클록 게이팅 방법에 비해 평균 2.46%(최대 5.00%)의 총 전력 소비를 줄이는 것을 확인하였다. 두 번째로 플립플롭의 클록 주기별 상태를 나타내는 긴 토글링/언토글링 시퀀스 를 짧은 임베딩 벡터로 변환하는 문제를 해결하였다. 이를 토글링 기반 클록 게이 팅을 위한 플립플롭 그룹화에 적용하여 플립플롭 간의 상태 유사성 확인이 메모리 사용량 및 실행 시간 측면에서 실질적으로 실현 가능하게 하였다. 이를 위해 기계 학습 기반으로 원래의 플립플롭 토글 시퀀스를 예측하기에 충분히 정확한 저차원의 임베딩 벡터의 생성을 제안한다. 우리는 토글링 시퀀스 간의 시계열 유사성을 고려 하기 위해 디노이즈 오토인코더를 이용하여 5000 클록 사이클의 토글링 시퀀스를 10차원으로 압축하고 이를 장단기 메모리 오토인코더에 입력하여 전체 시퀀스를 대변하는 저차원 임베딩 벡터를 생성하는 신경망 모델을 개발하였다. 또한 우리는 클록 게이팅을 위한 두 가지 부가적인 신경망 모델인 (1) 2개의 임베딩 벡터의 0- 상태 확률 생성을 위한 결합 확률 예측 모델과 (2) 두 개의 임베딩 벡터를 결합하여 새로운 임베딩 벡터를 예측하는 결합 특징 예측 모델을 제안한다. IWLS 벤치마크 회로를 이용한 실험을 통해, 디노이즈 오토인코더만 사용했을때보다 장단기 메모리 기반의 오토인코더를 결합했을 때 입력 데이터를 복원 정확도가 더 우수한 것을 확 인하였다. 또한 우리의 방법이 기존의 토글링 기반 클록 게이팅에 비해 평균 14.0% 의 동적 전력을 줄일 수 있음을 확인하였다.1 Selective Clock Gating Based on Comprehensive Power Saving Analysis 1 1.1 Introduction 1 1.2 Preliminary and Motivation 1 1.3 Selective Clock Gating 3 1.3.1 Concept of Selective Clock Gating 3 1.3.2 Joint probability of selection signals 5 1.4 Experimental Results 6 1.4.1 Experimental Setup 6 1.4.2 Experimental Result 7 1.5 Conclusion 10 2 Machine Learning Based Flip-Flop Grouping for Toggling Driven Clock Gating 11 2.1 Introduction 11 2.2 Preliminaries and Prior Works 13 2.2.1 Preliminary and Motivation 13 2.2.2 Prior Works 14 2.3 Machine Learning Based Clock Gating Framework 14 2.3.1 Primary Model: Embedding Vector Generation 14 2.3.2 Secondary Models: Joint State Probability and Joint Feature Prediction 17 2.3.3 Distance Analysis Between Embedding Vectors 18 2.3.4 Power Analysis Model 19 2.3.5 Overall Flow of Flip-flop Grouping 19 2.4 Experimental Results 19 2.4.1 Comparison of Dynamic Power Saving 20 2.4.2 Performance of Auto-encoder Reconstruction Model 21 2.5 Conclusion 21 Abstract (In Korean) 26석

SNU Open Repository and Archive

Lottery Aware Sparsity Hunting: Enabling Federated Learning on Resource-Limited Edge

Author: Avestimehr Salman
Babakniya Sara
Kundu Souvik
Niu Yue
Prakash Saurav
Publication venue
Publication date: 24/10/2023
Field of study

Edge devices can benefit remarkably from federated learning due to their distributed nature; however, their limited resource and computing power poses limitations in deployment. A possible solution to this problem is to utilize off-the-shelf sparse learning algorithms at the clients to meet their resource budget. However, such naive deployment in the clients causes significant accuracy degradation, especially for highly resource-constrained clients. In particular, our investigations reveal that the lack of consensus in the sparsity masks among the clients may potentially slow down the convergence of the global model and cause a substantial accuracy drop. With these observations, we present \textit{federated lottery aware sparsity hunting} (FLASH), a unified sparse learning framework for training a sparse sub-model that maintains the performance under ultra-low parameter density while yielding proportional communication benefits. Moreover, given that different clients may have different resource budgets, we present \textit{hetero-FLASH} where clients can take different density budgets based on their device resource limitations instead of supporting only one target parameter density. Experimental analysis on diverse models and datasets shows the superiority of FLASH in closing the gap with an unpruned baseline while yielding up to

\mathord{\sim}10.1\%

improved accuracy with

\mathord{\sim}10.26\times

fewer communication, compared to existing alternatives, at similar hyperparameter settings. Code is available at \url{https://github.com/SaraBabakN/flash_fl}.Comment: Accepted in TMLR, https://openreview.net/forum?id=iHyhdpsny

arXiv.org e-Print Archive

비용 효율적인 클럭 및 파워 게이팅 설계 방법론

Author: 현경환
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 전기·정보공학부,2020. 2. 김태환.저전력 설계는 최신 시스템-온-칩 (SoCs) 설계에서 매우 중요한 요소 중의 하나이다. 본 논문에서는 동적 및 정적 전력 소비를 감소시키기 위한 저전력 설계 방법론에 대해 논한다. 구체적으로 비용 효율적인 저전력 설계를 위하여 두 가지 새로운 기술을 제안한다. 우선 본 논문에서는 동적 전력 소비를 줄일 수 있는 새로운 클럭 게이팅 방법을 제안한다. 기존 플립-플랍 입력 데이터 토글 기반 클럭 게이팅은 가장 널리 사용되는 클럭 게이팅 기법 중의 하나이다. 하지만 이 방법은 더 많은 플립-플랍에 대해 적용할수록 클럭 게이팅에 필요한 부가 회로가 급격히 증가한다는 근본적인 한계를 지니고 있다. 이러한 한계를 극복하기 위하여 본 논문에서는 다음과 같이 새로운 클럭 게이팅 방법을 제안한다. 첫 번째로 기존 입력 데이터 토글 기반 클럭 게이팅 방법에 필요한 회로 자원을 분석하여 해당 방법의 비효율성을 보이고, 기존 방법에서 사용되는 입력 데이터 토글 검출에 필수적이지만 고비용의 XOR 게이트를 완벽히 제거한 플립-플랍 상태 기반 클럭 게이팅'이라는 새로운 클럭 게이팅 방법을 제안한다. 두 번째로 제안된 XOR 게이트가 필요 없는 클럭 게이팅 방법을 위한 부가 회로를 제시하며, 다양한 타이밍 분석을 통하여 해당 회로가 안정적으로 적용될 수 있음을 보인다. 세 번째로 회로의 플립-플랍 상태 프로파일에 기반하여, 제안된 클럭 게이팅 기법을 기존 클럭 게이팅 기법과 완벽하게 통합할 수 있는 클럭 게이팅 방법론을 제안한다. 여러 벤치마크 회로에 대한 실험 결과는 기존 입력 데이터 토글 기반 클럭 게이팅 방법이 전력 소비 절감 기회를 놓치는 반면 본 논문에서 제안된 방법은 모든 타이밍 제약 조건을 만족하면서 전력 소비 감소에 매우 효과적임을 보여준다. 다음으로 정적 전력 소비를 줄이기 위한 방안으로, 본 논문에서는 기존 파워 게이트 회로의 상태 보존용 저장 공간 할당 방법들이 지니고 있는 두 가지 중요한 한계들을 해결할 수 있는 방법을 제안한다. 중요한 한계들이란 첫 번째로 다중-비트 상태 보존 플립-플랍의 무분별한 사용으로 인한 긴 웨이크업 지연 시간이며, 두 번째로 멀티플렉서 되먹임 루프가 있는 상태 보존 플립-플랍의 최적화 불가능성이다. 기존 방법들에서는 상태 보존을 위한 저장 공간을 최소화하기 위해 긴 웨이크업 지연 시간이 필수적이었다. 그리고 되먹임 루프가 있는 플립-플랍은 최적화할 수 없는 대상으로 다루어졌다. 그러나 일반적으로 하드웨어 기술 언어(HDL)로부터 생성되는 되먹임 루프를 지닌 플립-플랍은 무시할 수 있을 정도로 적은 양이 아니다. 첫 번째 한계를 해결하기 위한 방법으로 본 논문에서는 최대 2 비트의 다중-비트 상태 보존 플립-플랍을 사용하여 웨이크업 지연 시간을 두 클럭 사이클로 제한하면서도 상태 보존을 위한 저장 공간을 효율적으로 절약할 수 있음을 보인다. 그리고 두 번째 한계를 극복하기 위해서 되먹임 루프를 지닌 플립-플랍이 포함된 두 플립-플랍 쌍의 상태를 복원할 수 있는 2단 상태 보존 제어 방안을 제안한다. 또한 주어진 회로에서 충돌없이 동시에 존재할 수 있는 플립-플랍 쌍을 최대로 추출하기 위해 독립 집합 문제(independent set problem)기반의 연산법도 제안한다. 벤치마크 회로에 대한 실험 결과는 본 논문에서 제안된 방법이 웨이크업 지연 시간을 두 클럭 사이클로 제한하면서도 상태 보존에 필요한 저장 공간과 파워를 감소시키는데 매우 효과적임을 보여준다.Low power design is of great importance in modern system-on-chips (SoCs). This dissertation studies on low power design methodologies for saving dynamic and static power consumption. Precisely, we unveil two novel techniques of cost effective low power design. Firstly, we propose a novel clock gating method for reducing the dynamic power consumption. Flip-flop's input data toggling based clock gating is one of the most commonly used clock gating methods, in which one critical and inherent limitation is the sharp increase of gating logic as more flip-flops are involved in gating. In this dissertation, we propose a new clock gating method to overcome this limitation. Specifically, (1) we analyze the resources of gating logic in the input data toggling based clock gating, from which an ineffectiveness in resource utilization is observed and we propose a new clock gating technique called flip-flop state driven clock gating which completely eliminates the essential and expensive component of XOR gates for detecting input toggling of flip-flops; (2) we provide the supporting logic circuitry of our proposed XOR-free clock gating, confirming its safe applicability through a comprehensive timing analysis; (3) we propose, based on the flip-flops' state profile, a clock gating methodology that seamlessly combines our flip-flop state based clock gating with the toggling based clock gating. Through experiments with benchmark circuits, it is confirmed that our clock gating method is very effective in reducing power, which otherwise the toggling based clock gating shall miss the power saving opportunity, while meeting all timing constraints. Secondly, for reducing the static power consumption, we solve two critical limitations of the conventional approaches to the allocation of state retention storage for power gated circuits. Those are (1) the long wakeup delay caused by the senseless use of multi-bit retention flip-flops (MBRFFs) and (2) the inability to optimize retention flip-flops for the flip-flops with mux-feedback loop. It should be noted that the conventional approaches have regarded the long wakeup delay as an inevitable consequence of maximizing the reduction of total storage size for state retention while they have treated the flip-flops with mux-feedback loop (called self-loop flip-flop) as nonoptimizable component, but practically, the self-loop flip-flops synthesized from hardware description language (HDL) code are not far from a small amount and thus, can in no way be negligible. More precisely, for solving (1), we show that the use of MBRFFs with up to two bits, consequently, constraining the wakeup delay to no more than two clock cycles, is enough to maintain the high reduction of total retention storage and for solving (2), we devise a 2-phase retention control mechanism for a pair of flip-flops, one of which has self-loop, by which just a single retention bit can be used to restore state of the two flip-flops, and propose an independent set based algorithm for maximally extracting the non-conflict pairs from circuits. Through experiments with benchmark circuits, it is shown that our proposed method is very effective against reducing the state retention storage and the power consumption compared with the existing best MBRFF allocation while the wakeup delay is strictly limited to two clock cycles.1 INTRODUCTION 1 1.1 Clock Gating 1 1.2 Power Gating and State Retention 3 1.3 Multi-bit Retention Registers 4 1.4 Contributions of This Dissertation 6 2 FLIP-FLOP STATE DRIVEN CLOCK GATING: CONCEPT, DESIGN, AND METHODOLOGY 9 2.1 Motivations 9 2.1.1 Toggling based Clock Gating 9 2.1.2 Area and Power by Clock Gating 10 2.2 The Proposed Clock Gating 13 2.2.1 Concept of Flip-flop State Driven Clock Gating 13 2.2.2 Design of Gating Logic Circuitry 17 2.2.3 Integrated Clock Gating Methodology 22 2.2.4 Cost Formulation 23 2.3 Experiments 25 2.3.1 Experimental Setup 25 2.3.2 Experimental Results 26 3 ALGORITHM AND DESIGN OPTIMIZATION OF ALLOCATING MULTI-BIT RETENTION FLIP-FLOPS FOR POWER GATED CIRCUITS 32 3.1 Motivations 32 3.1.1 Flip-flops with Mux-feedback Loop 32 3.1.2 Impact of Wakeup Delay 37 3.2 The Proposed Allocation Algorithm 39 3.3 Design of Multi-Bit Retention Flip-Flop and Multi-Bit Extension 48 3.3.1 Multi-Bit Retention Flip-Flop 48 3.3.2 Multi-Bit Flip-Flop Extension 52 3.4 Experiments 54 3.4.1 Experimental Setup 54 3.4.2 Experimental Results 57 4 CONCLUSIONS 65 4.1 Flip-flop State Driven Clock Gating: Concept, Design, and Methodology 65 4.2 Algorithm and Design Optimization of Allocating Multi-bit Retention Flip-flops for Power Gated Circuits 66 Abstract (In Korean) 71Docto

SNU Open Repository and Archive