1,662 research outputs found
Applications and accuracy of the parallel diagonal dominant algorithm
The Parallel Diagonal Dominant (PDD) algorithm is a highly efficient, ideally scalable tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is introduced. Then the algorithm is extended to solve periodic tridiagonal systems. A variant, the reduced PDD algorithm, is also proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric, and anti-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the algorithm is a good candidate for the emerging massively parallel machines
A simple parallel prefix algorithm for compact finite-difference schemes
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is highly efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study was conducted to provide a simple truncation formula. Experimental results were measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for the compact scheme on high-performance computers
Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications
Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm and Reduced Parallel Diagonal Dominant (RPDD) algorithm have been carefully studied on different parallel platforms for different applications, and a NASA simulation code developed by Man M. Rai and his colleagues has been parallelized and implemented based on data dependency analysis. These achievements are addressed in detail in the paper
Optimal cube-connected cube multiprocessors
Many CFD (computational fluid dynamics) and other scientific applications can be partitioned into subproblems. However, in general the partitioned subproblems are very large. They demand high performance computing power themselves, and the solutions of the subproblems have to be combined at each time step. The cube-connect cube (CCCube) architecture is studied. The CCCube architecture is an extended hypercube structure with each node represented as a cube. It requires fewer physical links between nodes than the hypercube, and provides the same communication support as the hypercube does on many applications. The reduced physical links can be used to enhance the bandwidth of the remaining links and, therefore, enhance the overall performance. The concept and the method to obtain optimal CCCubes, which are the CCCubes with a minimum number of links under a given total number of nodes, are proposed. The superiority of optimal CCCubes over standard hypercubes was also shown in terms of the link usage in the embedding of a binomial tree. A useful computation structure based on a semi-binomial tree for divide-and-conquer type of parallel algorithms was identified. It was shown that this structure can be implemented in optimal CCCubes without performance degradation compared with regular hypercubes. The result presented should provide a useful approach to design of scientific parallel computers
Distributed computing feasibility in a non-dedicated homogeneous distributed system
The low cost and availability of clusters of workstations have lead researchers to re-explore distributed computing using independent workstations. This approach may provide better cost/performance than tightly coupled multiprocessors. In practice, this approach often utilizes wasted cycles to run parallel jobs. The feasibility of such a non-dedicated parallel processing environment assuming workstation processes have preemptive priority over parallel tasks is addressed. An analytical model is developed to predict parallel job response times. Our model provides insight into how significantly workstation owner interference degrades parallel program performance. A new term task ratio, which relates the parallel task demand to the mean service demand of nonparallel workstation processes, is introduced. It was proposed that task ratio is a useful metric for determining how large the demand of a parallel applications must be in order to make efficient use of a non-dedicated distributed system
Hawking radiation-quasinormal modes correspondence for large AdS black holes
It is well-known that the non-strictly thermal character of the Hawking
radiation spectrum generates a natural correspondence between Hawking radiation
and black hole quasinormal modes. This main issue has been analyzed in the
framework of Schwarzschild black holes, Kerr black holes and nonextremal
Reissner-Nordstrom black holes. In this paper, by introducing the effective
temperature, we reanalysis the non-strictly thermal character of large AdS
black holes. The results show that the effective mass corresponding to the
effective temperature is approximatively the average one in any dimension. And
the other effective quantities can also be obtained. Based on the known forms
of frequency in quasinormal modes, we reanalysis the asymptotic frequencies of
the large AdS black hole in three and five dimensions. Then we get the formulas
of the Bekenstein-Hawking entropy and the horizon's area quantization with
functions of the quantum "overtone" number .Comment: 6 page
Dynamic response and dangerous point stress analysis of gear transmission system
Gear transmission is the principal power transmission mode of many machine, the reliability of transmission system has important influence on the accomplishment of daily task. This paper made a gear transmission system as the research object, we build the two-stage gear transmission system model and calculate its dynamic response in theory. Then, we study the mesh stiffness of gear concerning the variation of the mesh position from the gear transmission system. On the basis of these work, we establish the gear system’s finite element simulation model considering the tooth contact of internal gear system. After the simulation, we had get the contact response and the time history of some important area’s equivalent stress. Through these work, we can study the contact stress of the two-stage gear system in theory method and finite element simulation method, which has a guiding significance on the optimum structural design of two-stage transmission gear system
Recommended from our members
TunIO: An AI-powered Framework for Optimizing HPC I/O
I/O operations are a known performance bottleneck of HPC applications. To achieve good performance, users often employ an iterative multistage tuning process to find an optimal I/O stack configuration. However, an I/O stack contains multiple layers, such as high-level I/O libraries, I/O middleware, and parallel file systems, and each layer has many parameters. These parameters and layers are entangled and influenced by each other. The tuning process is time-consuming and complex. In this work, we present TunIO, an AI-powered I/O tuning framework that implements several techniques to balance the tuning cost and performance gain, including tuning the high-impact parameters first. Furthermore, TunIO analyzes the application source code to extract its I/O kernel while retaining all statements necessary to perform I/O. It utilizes a smart selection of high-impact configuration parameters of the given tuning objective. Finally, it uses a novel Reinforcement Learning (RL)-driven early stopping mechanism to balance the cost and performance gain. Experimental results show that TunIO leads to a reduction of up to ≈73% in tuning time while achieving the same performance gain when compared to H5Tuner. It achieves a significant performance gain/cost of 208.4 MBps/min (I/O bandwidth for each minute spent in tuning) over existing approaches under our testing
AST:Adaptive Self-supervised Transformer for optical remote sensing representation
Due to the variation in spatial resolution and the diversity of object scales, the interpretation of optical remote sensing images is extremely challenging. Deep learning has become the mainstream solution to interpret such complex scenes. However, the explosion of deep learning model architectures has resulted in the need for hundreds of millions of remote sensing images for which labels are very costly or often unavailable publicly. This paper provides an in-depth analysis of the main reasons for this data thirst, i.e., (i) limited representational power for model learning, and (ii) underutilization of unlabeled remote sensing data. To overcome the above difficulties, we present a scalable and adaptive self-supervised Transformer (AST) for optical remote sensing image interpretation. By performing masked image modeling in pre-training, the proposed AST releases the rich supervision signals in massive unlabeled remote sensing data and learns useful multi-scale semantics. Specifically, a cross-scale Transformer architecture is designed to collaboratively learn global dependencies and local details by introducing a pyramid structure, to facilitate multi-granular feature interactions and generate scale-invariant representations. Furthermore, a masking token strategy relying on correlation mapping is proposed to achieve adaptive masking of partial patches without affecting key structures, which enhances the understanding of visually important regions. Extensive experiments on various optical remote sensing interpretation tasks show that AST has good generalization capability and competitiveness.</p
- …