282,100 research outputs found
On the construction of an efficient finite-element solver for phase-field simulations of many-particle solid-state-sintering processes
We present an efficient solver for the simulation of many-particle solid-state-sintering processes. The microstructure evolution is described by a system of equations consisting of one Cahn–Hilliard equation and a set of Allen-Cahn equations to distinguish neighboring particles. The particle packing is discretized in space via multicomponent linear adaptive finite elements and implicitly in time with variable time-step sizes, resulting in a large nonlinear system of equations with strong coupling between all components to be solved. Since on average 10k degrees of freedom per particle are necessary to accurately capture the interface dynamics in 3D, we propose strategies to solve the resulting large and challenging systems. This includes the efficient evaluation of the Jacobian matrix as well as the implementation of Jacobian-free methods by applying state-of-the-art matrix-free algorithms for high and dynamic numbers of components, advances regarding preconditioning, and a fully distributed grain-tracking algorithm. We validate the obtained results, examine in detail the node-level performance and demonstrate the scalability up to 10k particles on modern supercomputers. Such numbers of particles are sufficient to simulate the sintering process in (statistically meaningful) representative volume elements. Our framework thus forms a valuable tool for the virtual design of solid-state-sintering processes for pure metals and their alloys
PI-BA Bundle Adjustment Acceleration on Embedded FPGAs with Co-observation Optimization
Bundle adjustment (BA) is a fundamental optimization technique used in many
crucial applications, including 3D scene reconstruction, robotic localization,
camera calibration, autonomous driving, space exploration, street view map
generation etc. Essentially, BA is a joint non-linear optimization problem, and
one which can consume a significant amount of time and power, especially for
large optimization problems. Previous approaches of optimizing BA performance
heavily rely on parallel processing or distributed computing, which trade
higher power consumption for higher performance. In this paper we propose
{\pi}-BA, the first hardware-software co-designed BA engine on an embedded
FPGA-SoC that exploits custom hardware for higher performance and power
efficiency. Specifically, based on our key observation that not all points
appear on all images in a BA problem, we designed and implemented a
Co-Observation Optimization technique to accelerate BA operations with
optimized usage of memory and computation resources. Experimental results
confirm that {\pi}-BA outperforms the existing software implementations in
terms of performance and power consumption.Comment: in Proceedings of IEEE FCCM 201
Problems in the optimal control of finite and infinite dimensional linear systems
A review of optimal control theory for linear systems with quadratic cost functions is presented. Some of the theoretical and practical limitations are discussed with special reference to distributed parameter systems. First a procedure is described for finding the optimal control by constructing a sequence of controllers that converges to the optimal; this method is valid for systems of infinite dimension provided that the operators in the state differential equation satisfy certain conditions. The proof is carried out both for the finite and infinite time interval and the connection is shown with the Riccati equation. The main problem in implementation is that one needs complete knowledge of the state at all times in order to build the optimal controller, this is almost certainly impossible for distributed parameter systems. When the state cannot be measured completely it is proved that an optimal control is realisable for time invariant finite dimensional systems.
The problems of finding this control are then investigated and computational methods discussed. If the optimal control with complete knowledge of the state cannot be implemented, a method is presented whereby one can find bounds on the possible increase in the value of the cost function arising from the use of some sub-optimal control; several examples are considered. The constrained optimal control depends on the initial state and new optimisation criteria must be put forward to deal with the case in which the initial state is unknown; the most common consist of minimising the cost that can result from the worst initial state. It is then shown how the controllers designed according to these criteria may be improved by using one's limited observation at time zero to place some constraints on the initial state. The Liapunov matrix equation plays an important part in calculating the cost of any control so reducing the computational effort in its solution is useful. It is shown how this can be done and it is of special relevance for distributed parameter systems with their states expressed as an infinite series of eigenfunctions; the results are applied to a diffusion equation example.
Finally, it is shown how optimal control theory may be applied to the design of proportional-integral-derivative controllers. This is done from two standpoints and the resulting controllers are shown to be identical, though the second method of proof is valid for infinite dimensional systems. The results are then applied to a simple example and to a distributed population dynamics system. The practicality of the methods of the thesis are applied to a system with realistic parameters; recommendations are made as to the best approaches
A general framework for efficient FPGA implementation of matrix product
Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe
Hamiltonian System Approach to Distributed Spectral Decomposition in Networks
Because of the significant increase in size and complexity of the networks,
the distributed computation of eigenvalues and eigenvectors of graph matrices
has become very challenging and yet it remains as important as before. In this
paper we develop efficient distributed algorithms to detect, with higher
resolution, closely situated eigenvalues and corresponding eigenvectors of
symmetric graph matrices. We model the system of graph spectral computation as
physical systems with Lagrangian and Hamiltonian dynamics. The spectrum of
Laplacian matrix, in particular, is framed as a classical spring-mass system
with Lagrangian dynamics. The spectrum of any general symmetric graph matrix
turns out to have a simple connection with quantum systems and it can be thus
formulated as a solution to a Schr\"odinger-type differential equation. Taking
into account the higher resolution requirement in the spectrum computation and
the related stability issues in the numerical solution of the underlying
differential equation, we propose the application of symplectic integrators to
the calculation of eigenspectrum. The effectiveness of the proposed techniques
is demonstrated with numerical simulations on real-world networks of different
sizes and complexities
Scaling Deep Learning on GPU and Knights Landing clusters
The speed of deep neural networks training has become a big bottleneck of
deep learning research and development. For example, training GoogleNet by
ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training
process, the current deep learning systems heavily rely on the hardware
accelerators. However, these accelerators have limited on-chip memory compared
with CPUs. To handle large datasets, they need to fetch data from either CPU
memory or remote processors. We use both self-hosted Intel Knights Landing
(KNL) clusters and multi-GPU clusters as our target platforms. From an
algorithm aspect, current distributed machine learning systems are mainly
designed for cloud systems. These methods are asynchronous because of the slow
network and high fault-tolerance requirement on cloud systems. We focus on
Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original
EASGD used round-robin method for communication and updating. The communication
is ordered by the machine rank ID, which is inefficient on HPC clusters.
First, we redesign four efficient algorithms for HPC systems to improve
EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD
are faster \textcolor{black}{than} their existing counterparts (Async SGD,
Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design
Sync EASGD, which ties for the best performance among all the methods while
being deterministic. In addition to the algorithmic improvements, we use some
system-algorithm codesign techniques to scale up the algorithms. By reducing
the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x
speedup over original EASGD on the same platform. We get 91.5% weak scaling
efficiency on 4253 KNL cores, which is higher than the state-of-the-art
implementation
- …