2,889 research outputs found
Automated CNN pipeline generation for heterogeneous architectures
Heterogeneity is a vital feature in emerging processor chip designing. Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster are common in modern edge devices. One example is Intel\u27s Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor. Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect.\ua0For Instance Intel\u27s XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms. An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices. In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.The proposed approach provides near optimal solution for a throughput maximizing pipeline. We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values. The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms. Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80\% of the tested configurations are sub-optimal solutions.Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution
III-V-on-silicon photonic devices for optical communication and sensing
In the paper, we review our work on heterogeneous III-V-on-silicon photonic components and circuits for applications in optical communication and sensing. We elaborate on the integration strategy and describe a broad range of devices realized on this platform covering a wavelength range from 850 nm to 3.85 μm
Optimization of Discrete-parameter Multiprocessor Systems using a Novel Ergodic Interpolation Technique
Modern multi-core systems have a large number of design parameters, most of
which are discrete-valued, and this number is likely to keep increasing as chip
complexity rises. Further, the accurate evaluation of a potential design choice
is computationally expensive because it requires detailed cycle-accurate system
simulation. If the discrete parameter space can be embedded into a larger
continuous parameter space, then continuous space techniques can, in principle,
be applied to the system optimization problem. Such continuous space techniques
often scale well with the number of parameters.
We propose a novel technique for embedding the discrete parameter space into
an extended continuous space so that continuous space techniques can be applied
to the embedded problem using cycle accurate simulation for evaluating the
objective function. This embedding is implemented using simulation-based
ergodic interpolation, which, unlike spatial interpolation, produces the
interpolated value within a single simulation run irrespective of the number of
parameters. We have implemented this interpolation scheme in a cycle-based
system simulator. In a characterization study, we observe that the interpolated
performance curves are continuous, piece-wise smooth, and have low statistical
error. We use the ergodic interpolation-based approach to solve a large
multi-core design optimization problem with 31 design parameters. Our results
indicate that continuous space optimization using ergodic interpolation-based
embedding can be a viable approach for large multi-core design optimization
problems.Comment: A short version of this paper will be published in the proceedings of
IEEE MASCOTS 2015 conferenc
Impact of gate-level clustering on automated system partitioning of 3D-ICs
When partitioning gate-level netlists using graphs, it is beneficial to
cluster gates to reduce the order of the graph and preserve some
characteristics of the circuit that the partitioning might degrade. Gate
clustering is even more important for netlist partitioning targeting 3D system
integration. In this paper, we make the argument that the choice of clustering
method for 3D-ICs partitioning is not trivial and deserves careful
consideration. To support our claim, we implemented three clustering methods
that were used prior to partitioning two synthetic designs representing two
extremes of the circuits medium/long interconnect diversity spectrum.
Automatically partitioned netlists are then placed and routed in 3D to compare
the impact of clustering methods on several metrics. From our experiments, we
see that the clustering method indeed has a different impact depending on the
design considered and that a circuit-blind, universal partitioning method is
not the way to go, with wire-length savings of up to 31%, total power of up to
22%, and effective frequency of up to 15% compared to other methods.
Furthermore, we highlight that 3D-ICs open new opportunities to design systems
with a denser interconnect, drastically reducing the design utilization of
circuits that would not be considered viable in 2D.Comment: 8 pages, 6 figure
- …