Search CORE

71 research outputs found

Cache complexity and multicore implementation for univariate real root isolation

Author: Changno Chen
Decker T.
Frigo M.
Marc Moreno Maza
Yuzhen Xie
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Parallel Integer Polynomial Multiplication

Author: Chen Changbo
Covanov Svyatoslav
Mansouri Farnam
Maza Marc Moreno
Xie Ning
Xie Yuzhen
Publication venue
Publication date: 24/09/2016
Field of study

We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. Complexity estimates and experimental comparisons demonstrate the advantages of this new approach

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

The Basic Polynomial Algebra Subprograms

Author: Chen Changbo
Covanov Svyatoslav
Mansouri Farnam
Moir Robert H. C.
Moreno Maza Marc
Xie Ning
Xie Yuzhen
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/09/2015
Field of study

International audienceThe Basic Polynomial Algebra Subprograms (BPAS) provides arithmetic operations (multiplication, division, root isolation, etc.) for univariate and multivariate polynomials over common types of coefficients (prime fields, complex rational numbers, rational functions, etc.). The code is mainly written in CilkPlus [10] targeting multicore processors. The current distribution focuses on dense polynomials and the sparse case is work in progress. A strong emphasis is put on adaptive algorithms as the library aims at supporting a wide variety of situations in terms of problem sizes and available computing resources. The BPAS library is publicly available in source at www.bpaslib.org

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

On The Parallelization Of Integer Polynomial Multiplication

Author: Mansouri Farnam
Publication venue: Scholarship@Western
Publication date: 01/01/2014
Field of study

With the advent of hardware accelerator technologies, multi-core processors and GPUs, much effort for taking advantage of those architectures by designing parallel algorithms has been made. To achieve this goal, one needs to consider both algebraic complexity and parallelism, plus making efficient use of memory traffic, cache, and reducing overheads in the implementations. Polynomial multiplication is at the core of many algorithms in symbolic computation such as real root isolation which will be our main application for now. In this thesis, we first investigate the multiplication of dense univariate polynomials with integer coefficients targeting multi-core processors. Some of the proposed methods are based on well-known serial classical algorithms, whereas a novel algorithm is designed to make efficient use of the targeted hardware. Experimentation confirms our theoretical analysis. Second, we report on the first implementation of subproduct tree techniques on many-core architectures. These techniques are basically another application of polynomial multiplication, but over a prime field. This technique is used in multi-point evaluation and interpolation of polynomials with coefficients over a prime field

CiteSeerX

Scholarship@Western

Computation of risk measures in finance and parallel real-time scheduling

Author: Li Yajuan
Publication venue: Digital Commons @ NJIT
Publication date: 31/08/2022
Field of study

Many application areas employ various risk measures, such as a quantile, to assess risks. For example, in finance, risk managers employ a quantile to help determine appropriate levels of capital needed to be able to absorb (with high probability) large unexpected losses in credit portfolios comprising loans, bonds, and other financial instruments subject to default. This dissertation discusses the computation of risk measures in finance and parallel real-time scheduling. Firstly, two estimation approaches are compared for one risk measure, a quantile, via randomized quasi-Monte Carlo (RQMC) in an asymptotic setting where the number of randomizations for RQMC grows large, but the size of the low-discrepancy point set remains fixed. In the first method, for each randomization, it computes an estimator of the cumulative distribution function (CDF), which is inverted to obtain a quantile estimator, and the overall quantile estimator is the sample average of the quantile estimators across randomizations. The second approach instead computes a single quantile estimator by inverting one CDF estimator across all randomizations. Because quantile estimators are generally biased, the first method leads to an estimator that does not converge to the true quantile as the number of randomizations goes to infinity. In contrast, the second estimator does, and a central limit theorem is established for it. To get an improvement, we use conditional Monte Carlo (CMC) to obtain a smoother estimate of the distribution function, and we combine this with the second RQMC to further reduce the variance. The result is a much more accurate quantile estimator, whose mean square error can converge even faster than the canonical rate of O(1/n). Secondly, another risk measure is estimated, namely economic capital (EC), which is defined as the difference between a quantile and the mean of the loss distribution, given a stochastic model for a portfolio’s loss over a given time horizon. This work applies measure-specific importance sampling to separately estimate the two components of the EC, which can lead to a much smaller variance than when estimating both terms simultaneously. Finally, for parallel real-time tasks, the federated scheduling paradigm, which assigns each parallel task a set of dedicated cores, achieves good theoretical bounds by ensuring exclusive use of processing resources to reduce interferences. However, because cores share the last-level cache and memory bandwidth resources, in practice tasks may still interfere with each other despite executing on dedicated cores. To tackle this issue, this work presents a holistic resource allocation framework for parallel real-time tasks under federated scheduling. Under the proposed framework, in addition to dedicated cores, each parallel task is also assigned with dedicated cache and memory bandwidth resources. This work also shows the study of the characteristics of parallel tasks upon different resource allocations following a measurement-based approach and proposes a technique to handle the challenge of tremendous profiling for all resource allocation combinations under this approach. Further, it proposes a holistic resource allocation algorithm that well balances the allocation between different resources to achieve good schedulability. Additionally, this work provides a full implementation of the framework by extending the federated scheduling system with Intel’s Cache Allocation Technology and MemGuard. It also demonstrates the practicality of the proposed framework via extensive numerical evaluations and empirical experiments using real benchmark programs. In the end, the discussion about the application of risk measures for real-time scheduling is given for future work

Digital Commons @ New Jersey Institute of Technology (NJIT)

Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

Author: Xie Ning
Publication venue: Scholarship@Western
Publication date: 08/11/2016
Field of study

The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP. In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation

Scholarship@Western

Fast Fourier Transforms over Prime Fields of Large Characteristic and their Implementation on Graphics Processing Units

Author: Mohajerani Davood
Publication venue: Scholarship@Western
Publication date: 20/12/2016
Field of study

Prime field arithmetic plays a central role in computer algebra and supports computation in Galois fields which are essential to coding theory and cryptography algorithms. The prime fields that are used in computer algebra systems, in particular in the implementation of modular methods, are often of small characteristic, that is, based on prime numbers that fit on a machine word. Increasing precision beyond the machine word size can be done via the Chinese Remaindering Theorem or Hensel Lemma. In this thesis, we consider prime fields of large characteristic, typically fitting on n machine words, where n is a power of 2. When the characteristic of these fields is restricted to a subclass of the generalized Fermat numbers, we show that arithmetic operations in such fields offer attractive performance both in terms of algebraic complexity and parallelism. In particular, these operations can be vectorized, leading to efficient implementation of fast Fourier transforms on graphics processing units

Scholarship@Western

Design Space Exploration and Resource Management of Multi/Many-Core Systems

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

Directory of Open Access Books (DOAB)

Applying Front End Compiler Process to Parse Polynomials in Parallel

Author: Tsegaye Amha W
Publication venue: Scholarship@Western
Publication date: 16/12/2020
Field of study

Parsing large expressions, in particular large polynomial expressions, is an important task for computer algebra systems. Despite of the apparent simplicity of the problem, its efficient software implementation brings various challenges. Among them is the fact that this is a memory bound application for which a multi-threaded implementation is necessarily limited by the characteristics of the memory organization of supporting hardware. In this thesis, we design, implement and experiment with a multi-threaded parser for large polynomial expressions. We extract parallelism by splitting the input character string, into meaningful sub-strings that can be parsed concurrently before being merged into a single polynomial. Our implementation targeting multi-core processors is realized with the Basic Polynomial Algebra Subprograms (BPAS). Experimental results show that the approach is promising both in terms of speedup factors and memory consumption

Scholarship@Western