282,100 research outputs found

    On the construction of an efficient finite-element solver for phase-field simulations of many-particle solid-state-sintering processes

    Get PDF
    We present an efficient solver for the simulation of many-particle solid-state-sintering processes. The microstructure evolution is described by a system of equations consisting of one Cahn–Hilliard equation and a set of Allen-Cahn equations to distinguish neighboring particles. The particle packing is discretized in space via multicomponent linear adaptive finite elements and implicitly in time with variable time-step sizes, resulting in a large nonlinear system of equations with strong coupling between all components to be solved. Since on average 10k degrees of freedom per particle are necessary to accurately capture the interface dynamics in 3D, we propose strategies to solve the resulting large and challenging systems. This includes the efficient evaluation of the Jacobian matrix as well as the implementation of Jacobian-free methods by applying state-of-the-art matrix-free algorithms for high and dynamic numbers of components, advances regarding preconditioning, and a fully distributed grain-tracking algorithm. We validate the obtained results, examine in detail the node-level performance and demonstrate the scalability up to 10k particles on modern supercomputers. Such numbers of particles are sufficient to simulate the sintering process in (statistically meaningful) representative volume elements. Our framework thus forms a valuable tool for the virtual design of solid-state-sintering processes for pure metals and their alloys

    PI-BA Bundle Adjustment Acceleration on Embedded FPGAs with Co-observation Optimization

    Full text link
    Bundle adjustment (BA) is a fundamental optimization technique used in many crucial applications, including 3D scene reconstruction, robotic localization, camera calibration, autonomous driving, space exploration, street view map generation etc. Essentially, BA is a joint non-linear optimization problem, and one which can consume a significant amount of time and power, especially for large optimization problems. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper we propose {\pi}-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC that exploits custom hardware for higher performance and power efficiency. Specifically, based on our key observation that not all points appear on all images in a BA problem, we designed and implemented a Co-Observation Optimization technique to accelerate BA operations with optimized usage of memory and computation resources. Experimental results confirm that {\pi}-BA outperforms the existing software implementations in terms of performance and power consumption.Comment: in Proceedings of IEEE FCCM 201

    Problems in the optimal control of finite and infinite dimensional linear systems

    Get PDF
    A review of optimal control theory for linear systems with quadratic cost functions is presented. Some of the theoretical and practical limitations are discussed with special reference to distributed parameter systems. First a procedure is described for finding the optimal control by constructing a sequence of controllers that converges to the optimal; this method is valid for systems of infinite dimension provided that the operators in the state differential equation satisfy certain conditions. The proof is carried out both for the finite and infinite time interval and the connection is shown with the Riccati equation. The main problem in implementation is that one needs complete knowledge of the state at all times in order to build the optimal controller, this is almost certainly impossible for distributed parameter systems. When the state cannot be measured completely it is proved that an optimal control is realisable for time invariant finite dimensional systems. The problems of finding this control are then investigated and computational methods discussed. If the optimal control with complete knowledge of the state cannot be implemented, a method is presented whereby one can find bounds on the possible increase in the value of the cost function arising from the use of some sub-optimal control; several examples are considered. The constrained optimal control depends on the initial state and new optimisation criteria must be put forward to deal with the case in which the initial state is unknown; the most common consist of minimising the cost that can result from the worst initial state. It is then shown how the controllers designed according to these criteria may be improved by using one's limited observation at time zero to place some constraints on the initial state. The Liapunov matrix equation plays an important part in calculating the cost of any control so reducing the computational effort in its solution is useful. It is shown how this can be done and it is of special relevance for distributed parameter systems with their states expressed as an infinite series of eigenfunctions; the results are applied to a diffusion equation example. Finally, it is shown how optimal control theory may be applied to the design of proportional-integral-derivative controllers. This is done from two standpoints and the resulting controllers are shown to be identical, though the second method of proof is valid for infinite dimensional systems. The results are then applied to a simple example and to a distributed population dynamics system. The practicality of the methods of the thesis are applied to a system with realistic parameters; recommendations are made as to the best approaches

    A general framework for efficient FPGA implementation of matrix product

    Get PDF
    Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe

    Hamiltonian System Approach to Distributed Spectral Decomposition in Networks

    Get PDF
    Because of the significant increase in size and complexity of the networks, the distributed computation of eigenvalues and eigenvectors of graph matrices has become very challenging and yet it remains as important as before. In this paper we develop efficient distributed algorithms to detect, with higher resolution, closely situated eigenvalues and corresponding eigenvectors of symmetric graph matrices. We model the system of graph spectral computation as physical systems with Lagrangian and Hamiltonian dynamics. The spectrum of Laplacian matrix, in particular, is framed as a classical spring-mass system with Lagrangian dynamics. The spectrum of any general symmetric graph matrix turns out to have a simple connection with quantum systems and it can be thus formulated as a solution to a Schr\"odinger-type differential equation. Taking into account the higher resolution requirement in the spectrum computation and the related stability issues in the numerical solution of the underlying differential equation, we propose the application of symplectic integrators to the calculation of eigenspectrum. The effectiveness of the proposed techniques is demonstrated with numerical simulations on real-world networks of different sizes and complexities

    Scaling Deep Learning on GPU and Knights Landing clusters

    Full text link
    The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation
    • …
    corecore