5 research outputs found

    Improving Efficiency in Deep Learning for Large Scale Visual Recognition

    Get PDF
    The emerging recent large scale visual recognition methods, and in particular the deep Convolutional Neural Networks (CNN), are promising to revolutionize many computer vision based artificial intelligent applications, such as autonomous driving and online image retrieval systems. One of the main challenges in large scale visual recognition is the complexity of the corresponding algorithms. This is further exacerbated by the fact that in most real-world scenarios they need to run in real time and on platforms that have limited computational resources. This dissertation focuses on improving the efficiency of such large scale visual recognition algorithms from several perspectives. First, to reduce the complexity of large scale classification to sub-linear with the number of classes, a probabilistic label tree framework is proposed. A test sample is classified by traversing the label tree from the root node. Each node in the tree is associated with a probabilistic estimation of all the labels. The tree is learned recursively with iterative maximum likelihood optimization. Comparing to the hard label partition proposed previously, the probabilistic framework performs classification more accurately with similar efficiency. Second, we explore the redundancy of parameters in Convolutional Neural Networks (CNN) and employ sparse decomposition to significantly reduce both the amount of parameters and computational complexity. Both inter-channel and inner-channel redundancy is exploit to achieve more than 90\% sparsity with approximately 1\% drop of classification accuracy. We also propose a CPU based efficient sparse matrix multiplication algorithm to reduce the actual running time of CNN models with sparse convolutional kernels. Third, we propose a multi-stage framework based on CNN to achieve better efficiency than a single traditional CNN model. With a combination of cascade model and the label tree framework, the proposed method divides the input images in both the image space and the label space, and processes each image with CNN models that are most suitable and efficient. The average complexity of the framework is significantly reduced, while the overall accuracy remains the same as in the single complex model

    Parallel finite element modeling of the hydrodynamics in agitated tanks

    Get PDF
    Mixing in the transition flow regime -- Technology to mix in transition flow regime -- Methods to characterize mixing hydrodynamics -- Challenges to numerically model transition flow regime in agited tanks -- Transition flow regime in agitated tanks -- Parallel computing -- Numerical modeling of the agitators motion -- Overall methodological approach -- Computational resources -- Program development strategy -- Parallel finite element simulations of incompressible viscous fluid flow by domain decomposition with Lagrange multipliers -- Parallel numerical model -- Parallel implementation -- Three-dimensional benchmark cases -- A parallel finite element sliding mesh technique for the Navier-Stokes equation -- Numerical method -- Parallel implementation -- Numerical examples -- Parallel performance -- Finite element modeling of the laminar and transition flow of the Superblend dual shaft coaxial mixer on parallel computers -- Superblend coaxial mixer configuration -- Numerical model -- Hydrodynamics in Superblend coaxial mixer -- Mixing -- Mixing efficiency -- Parallel finite element solver -- Parallel sliding mesh technique -- Simulation of the hydrodynamics of a stirred tank in the transition regime -- Recommendations for future research -- Parallel algorithms -- Simulation of agited and the transition flow regime

    Non-oscillatory forward-in-time method for incompressible flows

    Get PDF
    This research extends the capabilities of Non-oscillatory Forward-in-Time (NFT) solvers operating on unstructured meshes to allow for accurate simulation of incompressible turbulent flows. This is achieved by the development of Large Eddy Simulation (LES) and Detached Eddy Simulation (DES) turbulent flow methodologies and the development of parallel option of the flow solver. The effective use of LES and DES requires a development of a subgrid-scale model. Several subgrid-scale models are implemented and studied, and their efficacy is assessed. The NFT solvers employed in this work are based on the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) that facilitates novel implicit Large Eddy Simulation (ILES) approach to treating turbulence. The flexibility and robustness of the new NFT MPDATA solver are studied and successfully validated using well established benchmarks and concentrate on a flow past a sphere. The flow statistics from the solutions are compared against the existing experimental and numerical data and fully confirm the validity of the approach. The parallel implementation of the flow solver is also documented and verified showing a substantial speedup of computations. The proposed method lays foundations for further studies and developments, especially for exploring the potential of MPDATA in the context of ILES and associated treatments of boundary conditions at solid boundaries

    Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

    No full text
    The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. For systems that are ill-conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(O) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications: ordering significantly improves overall performance on both distributed and distributed shared-memory systems, that cache reuse may be more important than reducing communication, that it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and that a hybrid MPI+OpenMP paradigm increases programming complexity with little performance gains. A implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread level parallelism
    corecore