32 research outputs found
DLWUC: Distance and Load Weight Updated Clustering-Based Clock Distribution for SOC Architecture
High-clock skew variations and degradation of driving ability of buffers lead to an additional power dissipation in Clock Distribution Network (CDN) that increases the dimensionality of buffers and coordination among flip-flops. The manual threshold level to predict the Region of Interest (ROI) is not applicable in clustering process due to the complexities of excessive wire length and critical delay. This paper proposes the Distance and Load Weight Updated Clustering (DLWUC) to determine the suitable position of logical components. Initially, the DLWUC utilizes the Hybrid Weighted Distance (HWD) to estimate the distance and construct the distance matrix. The weight value extracted from the sorted distance matrix facilitates the projection of buffers. The updated weight value serves as the base for clustering with labeled outputs. The placement of buffer at the suitable place from load weight updated clustering provides the necessary trade-off between clock provision and load balance. The DLWUC discussed in this paper reduces the size of buffers, skew, power and latency compared to the existing topologies
Recovering complete and draft population genomes from metagenome datasets.
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution
Exploring Adaptive Implementation of On-Chip Networks
As technology geometries have shrunk to the deep submicron regime, the communication delay and power consumption of global interconnections in high performance Multi- Processor Systems-on-Chip (MPSoCs) are becoming a major bottleneck. The Network-on- Chip (NoC) architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication issues such as performance limitations of long interconnects and integration of large number of Processing Elements (PEs) on a chip. The choice of routing protocol and NoC structure can have a significant impact on performance and power consumption in on-chip networks. In addition, building a high performance, area and energy efficient on-chip network for multicore architectures requires a novel on-chip router allowing a larger network to be integrated on a single die with reduced power consumption. On top of that, network interfaces are employed to decouple computation resources from communication resources, to provide the synchronization between them, and to achieve backward compatibility with existing IP cores.
Three adaptive routing algorithms are presented as a part of this thesis. The first presented routing protocol is a congestion-aware adaptive routing algorithm for 2D mesh NoCs which does not support multicast (one-to-many) traffic while the other two protocols are adaptive routing models supporting both unicast (one-to-one) and multicast traffic. A streamlined on-chip router architecture is also presented for avoiding congested areas in 2D mesh NoCs via employing efficient input and output selection. The output selection utilizes an adaptive routing algorithm based on the congestion condition of neighboring routers while the input selection allows packets to be serviced from each input port according to its congestion level. Moreover, in order to increase memory parallelism and bring compatibility with existing IP cores in network-based multiprocessor architectures, adaptive network interface architectures are presented to use multiple SDRAMs which can be accessed simultaneously. In addition, a smart memory controller is integrated in the adaptive network interface to improve the memory utilization and reduce both memory and network latencies.
Three Dimensional Integrated Circuits (3D ICs) have been emerging as a viable candidate to achieve better performance and package density as compared to traditional 2D ICs. In addition, combining the benefits of 3D IC and NoC schemes provides a significant performance gain for 3D architectures. In recent years, inter-layer communication across multiple stacked layers (vertical channel) has attracted a lot of interest. In this thesis, a novel adaptive pipeline bus structure is proposed for inter-layer communication to improve the performance by reducing the delay and complexity of traditional bus arbitration. In addition, two mesh-based topologies for 3D architectures are also introduced to mitigate the inter-layer footprint and power dissipation on each layer with a small performance penalty.Siirretty Doriast
Facility Location and Clock Tree Synthesis
The construction of clock trees and repeater trees are major challenges in chip design. Such trees distribute an electrical clock signal from a source to a set of sinks on a chip. On recent designs there can be millions of repeater trees with only a few up to some hundred sinks and several clock trees with up to some hundred thousand of sinks. In repeater trees the signal has to arrive at each sink not later than an individual required arrival time, while in clock trees it has to arrive at each sink within an individual required arrival time window. In this thesis, we present new theory and algorithms for the construction of clock trees and repeater trees and an essential sub-problem, the Sink Clustering Problem. We also describe our clock tree construction tool BonnClock, which has been used by IBM Microelectronics for the design of hundreds of most complex chips. First, we introduce the Sink Clustering Problem, the main sub-problem of clock tree design. Given a metric space (V,c), a finite set D of terminals with positions p(v) ∈ V and demands d(v) ∈ R ≥ 0 for all v ∈ D, a facility opening cost f ∈ R>0 and a load limit u ∈ R>0 , the task is to find a partition D=D1 ∪ ... ∪ Dk of D and, for all 1 ≤ i ≤ k, a Steiner tree Si for {p(v)| v ∈ Di }. Each cluster (Di ,Si ), 1 ≤ i ≤ k, has to keep the load limit, that means ∑e ∈ E(Si) c(e) +∑s ∈ Di d(s) ≤ u. The goal is to minimize the weighted sum of the length of all Steiner trees plus the number of clusters, i.e. minimize ∑i=1,...,k (∑e ∈ E(Si ) c(e)) +kf. We present the first constant-factor approximation algorithm for the Sink Clustering Problem. It is based on decomposing a minimum spanning tree on the sinks and has an approximation guarantee of 1+2α, where α is the Steiner ratio of the underlying metric. Moreover, we introduce two variants of the algorithm that rely on decomposing an approximate minimum Steiner tree and an approximate minimum traveling salesman tour. These algorithms have approximation guarantees of 3β and 3γ, respectively, where β and γ are the approximation guarantees of the Steiner tree and TSP approximation algorithms, respectively. We also propose two post-optimization algorithms that can further improve an existing clustering. We analyze the structure of the Sink Clustering Problem and exhibit its connections to matroid theory. In particular, we use the property of matroids that for any two bases B1 , B2 there is a bijection p : B1 → B2 so that (B1 \ {b}) ∪ {p(b)} is again a basis for each b ∈ B1. We replace each Steiner tree of an optimum solution by a minimum spanning tree and connect all trees to a new artificial vertex s and get a tree S. In a modified metric the total length of S is a good lower bound for the cost of an optimum solution. Due to the matroid property we can compare a minimum spanning tree T on D ∪ {s} with S; the length of any edge of T is bounded by the length of an edge of S. We introduce the concept of K-dominated functions that helps us to increase the `cost' of certain edges of T while still having the property that the total length of all edges of T ending in a vertex of K ⊆ D is bounded by the total length of all edges of S ending in a vertex of K. Applying this procedure to the sets of a laminar family on D yields an improved lower bound. The bound can be further improved by combining it with a lower bound for the length of a minimum Steiner tree on D. For this bound we prove the following lemma: For any family of trees T = {T1 ,..., Tk } with V(Ti ) ⊂ D, 1 ≤ i ≤ k, with the property that for any subset T' ⊆ T the trees in T' cover at least | T' |+1 vertices, there exists an edge ei ∈ E(Ti ) for i=1,..., k such that these edges E={ei | 1 ≤ i ≤ k} form a forest, i.e. the set does not contain an edge twice and it does not contain a circuit. Our experimental results on real-world instances from clock tree design show that the cost of the solutions computed by our algorithms is in average only 10% over the best lower bound. Moreover, we compare our algorithm to another clustering algorithm used in industry. The results show that the total cost of our solutions is 10% less than the cost of the solutions computed by the competitive tool. Clock trees have to satisfy several timing constraints. More precisely, the signal has to reach each sink within an individual required arrival time window. Sinks can only be clustered together if their required arrival time windows have a point of time in common. Typically, all required arrival time windows are the same. In this case we have the Sink Clustering Problem defined above. However, there are clock trees where the sinks have different required arrival time windows. This motivates a generalization of the Sink Clustering Problem where each sink additionally has an individual time window. As further constraint the time windows of the sinks of a cluster must have at least one point of time in common. We study the Sink Clustering Problem with Time Windows and present a polynomial O(log s)-approximation algorithm for this problem, where s is the size of a minimum clique partition in the interval graph induced by the time windows. Our algorithm is based on a divide and conquer approach and uses the approximation algorithms for the Sink Clustering Problem on sub-sets of the instance. We show that the approximation guarantee of the algorithm is tight. For the practical construction of clock trees we present our algorithm BonnClock. BonnClock builds a clock tree combining a bottom-up clustering and a top-down partitioning strategy. In the bottom-up phase BonnClock is using the Sink Clustering Algorithm in order to determine the drivers of unconnected sinks or inverters. The `global' topology of the tree is determined by the top-down partitioning considering big blockages and timing restrictions. BonnClock uses a dynamic program in order to determine the sizes of the inverters that are inserted. All components of the algorithm are discussed in detail. As part of this thesis, we have also implemented this algorithm. BonnClock has become the standard tool to construct clock trees within IBM. We show experimental results with comparisons to another industrial clock tree construction tool and to lower bounds for the power consumption. It turns out that - mainly due to the Sink Clustering Algorithm - our power consumption is much smaller than with the other tool and only one third over the lower bound. Finally, we consider the repeater tree construction problem. In contrast to clock trees, each sink has a latest required arrival time instead of a time window. We describe a simple algorithm to build such trees where we insert the sinks one by one into an existing tree. Depending on the optimization goal we show a variant of the algorithm computing trees of almost optimal length or trees with guaranteed best possible performance. Moreover, we analyze the topology of trees with best or almost best performance more closely. Such trees are equivalent to minimax and almost minimax trees: Let a1 , ... , an ∈ N ≥ 0 be a set of numbers. The weight of a tree with n leaves is the maximum over all leaves i of the depth of leaf i plus ai. For a non-negative integral constant c the goal is to build a binary tree with weight at most the optimum weight plus c. This problem can be solved optimally by a greedy algorithm. However, we are interested in the online version of this problem where we have to insert the leaf i with weight ai into the tree without knowing n and the following weights aj, j> i. We give necessary and sufficient conditions for an online algorithm to compute trees of weight at most the optimum weight plus c. Moreover, we show how these conditions can be verified efficiently. We obtain an online algorithm that computes an optimum tree in O(nlog n) time. Finally, we study a further mathematical model of repeater trees that considers that additional delay caused by a bifurcation of a tree can be distributed partially to the two branches. For c∈ R>0 and a set L ⊆ {(l1 ,l2 ) ∈ R2 ≥ 0 | l1 +l2 = c} of two-element sets of non-negative real numbers we consider rooted binary trees with the property that the two edges emanating from every non-leaf are assigned lengths l1 and l2 with { l1 ,l2 } ? L. We study the asymptotic growth of the maximum number of leaves of bounded depths in such trees and the existence of such trees with leaves at individually specified maximum depths. Our results yield better lower bounds for repeater trees
Methods for Epigenetic Analyses from Long-Read Sequencing Data
Epigenetics, particularly the study of DNA methylation, is a cornerstone field for our understanding of human development and disease.
DNA methylation has been included in the "hallmarks of cancer" due to its important function as a biomarker and its contribution to carcinogenesis and cancer cell plasticity.
Long-read sequencing technologies, such as the Oxford Nanopore Technologies platform, have evolved the study of structural variations, while at the same time allowing direct measurement of DNA methylation on the same reads.
With this, new avenues of analysis have opened up, such as long-range allele-specific methylation analysis, methylation analysis on structural variations, or relating nearby epigenetic modalities on the same read to another.
Basecalling and methylation calling of Nanopore reads is a computationally expensive task which requires complex machine learning architectures.
Read-level methylation calls require different approaches to data management and analysis than ones developed for methylation frequencies measured from short-read technologies or array data.
The 2-dimensional nature of read and genome associated DNA methylation calls, including methylation caller uncertainties, are much more storage costly than 1-dimensional methylation frequencies.
Methods for storage, retrieval, and analysis of such data therefore require careful consideration.
Downstream analysis tasks, such as methylation segmentation or differential methylation calling, have the potential of benefiting from read information and allow uncertainty propagation.
These avenues had not been considered in existing tools.
In my work, I explored the potential of long-read DNA methylation analysis and tackled some of the challenges of data management and downstream analysis using state of the art software architecture and machine learning methods.
I defined a storage standard for reference anchored and read assigned DNA methylation calls, including methylation calling uncertainties and read annotations such as haplotype or sample information.
This storage container is defined as a schema for the hierarchical data format version 5, includes an index for rapid access to genomic coordinates, and is optimized for parallel computing with even load balancing.
It further includes a python API for creation, modification, and data access, including convenience functions for the extraction of important quality statistics via a command line interface.
Furthermore, I developed software solutions for the segmentation and differential methylation testing of DNA methylation calls from Nanopore sequencing.
This implementation takes advantage of the performance benefits provided by my high performance storage container.
It includes a Bayesian methylome segmentation algorithm which allows for the consensus instance segmentation of multiple sample and/or haplotype assigned DNA methylation profiles, while considering methylation calling uncertainties.
Based on this segmentation, the software can then perform differential methylation testing and provides a large number of options for statistical testing and multiple testing correction.
I benchmarked all tools on both simulated and publicly available real data, and show the performance benefits compared to previously existing and concurrently developed solutions.
Next, I applied the methods to a cancer study on a chromothriptic cancer sample from a patient with Sonic Hedgehog Medulloblastoma.
I here report regulatory genomic regions differentially methylated before and after treatment, allele-specific methylation in the tumor, as well as methylation on chromothriptic structures.
Finally, I developed specialized methylation callers for the combined DNA methylation profiling of CpG, GpC, and context-free adenine methylation.
These callers can be used to measure chromatin accessibility in a NOMe-seq like setup, showing the potential of long-read sequencing for the profiling of transcription factor co-binding.
In conclusion, this thesis presents and subsequently benchmarks new algorithmic and infrastructural solutions for the analysis of DNA methylation data from long-read sequencing
Adaptive Routing Approaches for Networked Many-Core Systems
Through advances in technology, System-on-Chip design is moving towards integrating tens to hundreds of intellectual property blocks into a single chip. In such a many-core system, on-chip communication becomes a performance bottleneck for high performance designs. Network-on-Chip (NoC) has emerged as a viable solution for the communication challenges in highly complex chips. The NoC architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication challenges such as wiring complexity, communication latency, and bandwidth. Furthermore, the combined benefits of 3D IC and NoC schemes provide the possibility of designing a high performance system in a limited chip area. The major advantages of 3D NoCs are the considerable reductions in average latency and power consumption.
There are several factors degrading the performance of NoCs. In this thesis, we investigate three main performance-limiting factors: network congestion, faults, and the lack of efficient multicast support. We address these issues by the means of routing algorithms.
Congestion of data packets may lead to increased network latency and power consumption. Thus, we propose three different approaches for alleviating such congestion in the network. The first approach is based on measuring the congestion information in different regions of the network, distributing the information over the network, and utilizing this information when making a routing decision. The second approach employs a learning method to dynamically find the less congested routes according to the underlying traffic. The third approach is based on a fuzzy-logic technique to perform better routing decisions when traffic information of different routes is available.
Faults affect performance significantly, as then packets should take longer paths in order to be routed around the faults, which in turn increases congestion around the faulty regions. We propose four methods to tolerate faults at the link and switch level by using only the shortest paths as long as such path exists. The unique characteristic among these methods is the toleration of faults while also maintaining the performance of NoCs. To the best of our knowledge, these algorithms are the first approaches to bypassing faults prior to reaching them while avoiding unnecessary misrouting of packets.
Current implementations of multicast communication result in a significant performance loss for unicast traffic. This is due to the fact that the routing rules of multicast packets limit the adaptivity of unicast packets. We present an approach in which both unicast and multicast packets can be efficiently routed within the network. While suggesting a more efficient multicast support, the proposed approach does not affect the performance of unicast routing at all. In addition, in order to reduce the overall path length of multicast packets, we present several partitioning methods along with their analytical models for latency measurement. This approach is discussed in the context of 3D mesh networks.Siirretty Doriast
Advanced Image Acquisition, Processing Techniques and Applications
"Advanced Image Acquisition, Processing Techniques and Applications" is the first book of a series that provides image processing principles and practical software implementation on a broad range of applications. The book integrates material from leading researchers on Applied Digital Image Acquisition and Processing. An important feature of the book is its emphasis on software tools and scientific computing in order to enhance results and arrive at problem solution
Recommended from our members
Machine Learning for AI-Augmented Design Space Exploration of Computer Systems
Advanced and emerging computer systems, ranging from supercomputers to embedded systems, feature high performance, energy efficiency, acceleration, and specialization. Design of such systems involves ever-increasing circuit complexity and architectural diversity. Commercial high-end processors, realized as very-large-scale integration circuits, have integrated exponentially increasing number of transistors on a chip over many decades. Along with the evolution of semiconductor manufacturing technology, another driving force behind the progress of processors has been the development of computer-aided design (CAD) software tools. Logic synthesis and physical design (LSPD) tool-chains allow designers to describe the computer system at the register-transfer level of abstraction and automatically convert the description into an integration circuit layout. The slowdown of technology scaling, on the other hand, has motivated the emergence of dark silicon and heterogeneous architectures with application-specific hardware accelerators. Design of various accelerators is facilitated by high-level synthesis (HLS) tools that translate a behavioral description of a computer system into a structural register-transfer level one. CAD approaches have evolved towards raising the level of design abstraction and providing more options to optimize the architecture.
For each system synthesized via advanced CAD tools, designers explore the design space in search of optimal configurations of the tool options and architectural choices, also called . These knobs affect the execution of CAD algorithms and eventually impact the multi-dimensional -- () of the final implementation. During design-space exploration (DSE), designers leverage their experience and expertise pertaining to determining the relationship between knobs and QoR. To further reduce the number of time and resource consuming CAD runs during DSE, a large number of heuristic and model-based approaches have been proposed. More recently, the rise of machine learning (ML) and artificial intelligence (AI) has prompted the possibility of AI-augmented DSE which exploits ML techniques to predict the knobs-QoR relationship. Yet, existing heuristic and ML-based approaches still require a sufficient number of CAD runs for each system because they do not accumulate and exploit experiential knowledge across the systems as designers would do.
To expand the potential of AI-augmented DSE and push the frontier forward, multiple challenges arise due to the characteristics of CAD flows. 1) Whereas many ML applications utilize data obtained from huge collections of users' input and public databases for a single problem, the QoR-prediction problem for each system suffers from limited availability of data obtained from expensive CAD runs. Especially, an industrial LSPD tool-chain specifies hundreds of separate knobs, resulting in an extreme curse of dimensionality. 2) Different systems exhibit different knobs-QoR relationship. Hence, learning from previously explored systems needs to be preceded by identifying distinct systems and relating them to one another. Often, it is difficult to obtain an efficient representation of a system. 3) Designers often apply different sets of knob configurations to different systems, which makes it harder to learn from previous DSE results. Especially in HLS, the heterogeneity of various systems leads to broad knob heterogeneity across them. To address these challenges and boost the ML performance, I propose to flexibly connect the elements of the many QoR-prediction problems with one another. My thesis is that .
For LSPD of industrial high-performance processors, I propose a novel collaborative recommender system approach that learns hidden features from the interactions (CAD runs) of many \textit{users} (systems) and \textit{items} (knob configurations). To cope with the curse of dimensionality, the item features are decomposed into features of item attributes (knobs). The combined model predicts QoR for each user-item pair. For HLS of application-specific accelerators, I present a series of neural network models in the order of evolution towards the proposed mixed-sharing \textit{transfer learning} model. Transfer learning aims at leveraging knowledge gained from previous problems; however, due to the system and knob heterogeneities, the model needs to distinguish which piece of that knowledge should be transferred. The proposed ML approaches aim to not only use experiential knowledge as designers do but also to ultimately assist designers by providing alternative insights and suggesting optimization possibilities for new systems. As an effort in this direction, I develop an AI-augmented DSE tool that exploits the aforementioned models and \textit{generates} recommended knob configurations for new target systems. Through this research, I investigate the potential of next-level AI-augmented DSE with the goal of promoting secure collaborative engineering in the CAD community without the need of sharing confidential information and intellectual properties
Adaptive Hybrid Switching Technique for Parallel Computing System
Parallel processing accelerates computations by solving a single problem using multiple compute nodes interconnected by a network. The scalability of a parallel system is limited byits ability to communicate and coordinate processing. Circuit switching, packet switchingand wormhole routing are dominant switching techniques. Our simulation results show that wormhole routing and circuit switching each excel under different types of traffic.This dissertation presents a hybrid switching technique that combines wormhole routing with circuit switching in a single switch using vrtual channels and time division multiplexing. The performance of this hybrid switch is significantly impacted by the effciency of traffic scheduling and thus, this dissertation also explores the design and scalability of hardware scheduling for the hybrid switch. In particular, we introduce two schedulers for crossbar networks: a greedy scheduler and an optimal scheduler that improves upon the resultsprovided by the greedy scheduler. For the time division multiplexing portion of the hybrid switch, this dissertation presents three allocation methods that combine wormhole switching with predictive circuit switching. We further extend this research from crossbar networks to fat tree interconnected networks with virtual channels. The global "level-wise" scheduling algorithm is presented and improves network utilization by 30% when compared to a switch-level algorithm. The performance of the hybrid switching is evaluated on a cycle accurate simulation framework that is also part of this dissertation research. Our experimental results demonstrate that the hybrid switch is capable of transferring both predictable traffics and unpredictable traffics successfully. By dynamically selecting the proper switching technique based on the type of communication traffic, the hybrid switch improves communication for most types of traffic