Search CORE

179 research outputs found

Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

Crossref

Caltech Authors

DR-NTU (Digital Repository of NTU)

Optimistic Parallelization of Floating-Point Accumulation

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks

CiteSeerX

Crossref

Caltech Authors

ScholarlyCommons@Penn

Pipelining Saturated Accumulation

Author: Chan Stephanie
DeHon André
Kapre Nachiket
Papadantonakis Karl
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/04/2008
Field of study

Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

CiteSeerX

Caltech Authors

SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

Author: Kapre Nachiket Ganesh
Publication venue
Publication date: 01/01/2011
Field of study

Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

Caltech Theses and Dissertations

Accelerating SPICE Model-Evaluation using FPGAs

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Single-FPGA spatial implementations can provide an order of magnitude speedup over sequential microprocessor implementations for data-parallel, floating-point computation in SPICE model-evaluation. Model-evaluation is a key component of the SPICE circuit simulator and it is characterized by large irregular floating-point compute graphs. We show how to exploit the parallelism available in these graphs on single-FPGA designs with a low-overhead VLIW-scheduled architecture. Our architecture uses spatial floating-point operators coupled to local high-bandwidth memories and interconnected by a time-shared network. We retime operation inputs in the model-evaluation to allow independent scheduling of computation and communication. With this approach, we demonstrate speedups of 2–18× over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx Virtex 5 LX330T for a variety of SPICE device models

CiteSeerX

Crossref

Caltech Authors

DR-NTU (Digital Repository of NTU)

Management Strategies for Oral Cancer Subsites

Author: Gupta Neeti Kapre
Hore Apeksha
Mahajan Monica
Publication venue: 'IntechOpen'
Publication date: 23/10/2019
Field of study

Oral cancers are the most common cancers in India, especially in males. This can be attributed primarily to consumption of tobacco and areca related products. Surgery is the mainstay of treatment for oral cancers with subtle subsite-specific nuances. The oral cavity starts at the mucocutaneous junction of the lips (the vermilion border) extending posteriorly to the junction of the hard and soft palate superiorly, anterior fauces laterally and the junction of the anterior two-thirds and posterior third of the tongue inferiorly. The oral cavity is lined by stratified squamous epithelium of varying degrees of keratinization. Primary tumors of the oral cavity may be derived from the mucosa, salivary glands, neurovascular tissues, bone or dental tissues. Over 90% of malignant tumors of the oral cavity are squamous cell carcinomas. There are certain basic principles of oncology, those hold true, despite the disease subsite and pathology. Stage I and II disease should be dealt with single modality treatment, whereas Stage III and IV warrant combined modality approach. Choice of modality (surgical versus non-surgical), depends on intent of treatment, chances of cure, accessibility and resectability of disease, impact on quality of life and patient’s general health profile

IntechOpen

Crossref

Algorithms on Evolving Graphs

Author: Kapre S
Publication venue
Publication date: 01/01/2013
Field of study

Today's applications process large scale graphs which are evolving in nature. We study new com- putational and data model to study such graphs. In this framework, the algorithms are unaware of the changes happening in the evolving graphs. The algorithms are restricted to probe only lim- ited portion of graph data and are expected to produce a solution close to the optimal one and that too at each time step. This frameworks assumes no constraints on resources like memory and computation time. The limited resource for such algorithms is the limited portion of graph that is allowed to probe (e.g. the number of queries an algorithm can make in order to learn about the graph). We apply this framework to two classical graph theory problems: Shortest Path problem and Maximum Flow problem. We study the way algorithm behaves under evolving model and how does the evolving nature of the graph aects the solution given by the algorithm

Research Archive of Indian Institute of Technology Hyderabad

FRICTIONLESS ONBOARDING FOR END-TO-END ENCRYPTED COLLABORATIVE SYSTEMS

Author: Cheng Qingwen
Kapre Nikhil
Srinath Uday
Wooler Nick
Publication venue: Technical Disclosure Commons
Publication date: 08/03/2021
Field of study

Many current systems require a user account to access system features or allow a guest mode to skip account creation. In collaboration software that includes a guest mode, there are two features that are currently supported: (1) Joining a meeting via a guest mode, and (2) Guest mode for an entire application using an anonymous token, which is a token for a particular session only and needs another guest session on re-entry; the guest session is not persisted on retry. As a result, it is critical to solve the problem of onboarding allowing users to enter the system easily in a Try Now or Guest Mode that provides a full-feature rich experience that a user would get if the user had signed-up for an account. The problem is even more challenging for collaboration software that utilizes end-to-end encryption. This proposal provides techniques to leverage the existing Open Authorization (OAuth) flow by deferring email verification and password creation to reduce the time involved to join a guest session in an end-to-end collaborative system. By utilizing techniques of this proposal, a persistent guest session can be facilitated on a given client device and a clear path can be provided to upgrade to a full free and/or paid account

Technical Disclosure Common

Saliency on a chip: a digital approach with an FPGA

Author: DeHon André
Kapre Nachiket
Koch Christof
Walther Dirk B.
Publication venue: Institute of Neuromorphic Engineering
Publication date: 01/01/2004
Field of study

Selective-visual-attention algorithms have been successfully implemented in analog VLSI circuits.1 However, in addition to the usual issues of analog VLSI—such as the need to fi ne-tune a large number of biases— these implementations lack the spatial resolution and pre-processing capabilities to be truly useful for image-processing applications. Here we take an alternative approach and implement a neuro-mimetic algorithm for selective visual attention in digital hardware

Caltech Authors

Pipelining saturated accumulation

Author: Andre DeHon
Karl Papadantonakis
Nachiket Kapre
Stephanie Chan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref