6,591 research outputs found

    Parallel growing and training of neural networks using output parallelism

    Get PDF
    In order to find an appropriate architecture for a large-scale real-world application automatically and efficiently, a natural method is to divide the original problem into a set of sub-problems. In this paper, we propose a simple neural network task decomposition method based on output parallelism. By using this method, a problem can be divided flexibly into several sub-problems as chosen, each of which is composed of the whole input vector and a fraction of the output vector. Each module (for one sub-problem) is responsible for producing a fraction of the output vector of the original problem. The hidden structure for the original problem’s output units are decoupled. These modules can be grown and trained in parallel on parallel processing elements. Incorporated with a constructive learning algorithm, our method does not require excessive computation and any prior knowledge concerning decomposition. The feasibility of output parallelism is analyzed and proved. Some benchmarks are implemented to test the validity of this method. Their results show that this method can reduce computational time, increase learning speed and improve generalization accuracy for both classification and regression problems

    Task decomposition using pattern distributor

    Get PDF
    In this paper, we propose a new task decomposition method for multilayered feedforward neural networks, namely Task Decomposition with Pattern Distributor in order to shorten the training time and improve the generalization accuracy of a network under training. This new method uses the combination of modules (small-size feedforward network) in parallel and series, to produce the overall solution for a complex problem. Based on a “divide-and-conquer” technique, the original problem is decomposed into several simpler sub-problems by a pattern distributor module in the network, where each sub-problem is composed of the whole input vector and a fraction of the output vector of the original problem. These sub-problems are then solved by the corresponding groups of modules, where each group of modules is connected in series with the pattern distributor module and the modules in each group are connected in parallel. The design details and implementation of this new method are introduced in this paper. Several benchmark classification problems are used to test this new method. The analysis and experimental results show that this new method could reduce training time and improve generalization accuracy

    Reduced pattern training based on task decomposition using pattern distributor

    Get PDF
    Task Decomposition with Pattern Distributor (PD) is a new task decomposition method for multilayered feedforward neural networks. Pattern distributor network is proposed that implements this new task decomposition method. We propose a theoretical model to analyze the performance of pattern distributor network. A method named Reduced Pattern Training is also introduced, aiming to improve the performance of pattern distribution. Our analysis and the experimental results show that reduced pattern training improves the performance of pattern distributor network significantly. The distributor module’s classification accuracy dominates the whole network’s performance. Two combination methods, namely Cross-talk based combination and Genetic Algorithm based combination, are presented to find suitable grouping for the distributor module. Experimental results show that this new method can reduce training time and improve network generalization accuracy when compared to a conventional method such as constructive backpropagation or a task decomposition method such as Output Parallelism

    Distributed learning of CNNs on heterogeneous CPU/GPU architectures

    Get PDF
    Convolutional Neural Networks (CNNs) have shown to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are offered by several frameworks dedicated to neural network training. However, these techniques do not take full advantage of the possible parallelization offered by CNNs and the cooperative use of heterogeneous devices with different processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 6060-9090\% of global processing time. The paper analyzes the influence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without affecting the classification performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500500 and 15001500 kernels, respectively, best speedups achieve 3.28×3.28\times using four CPUs and 2.45×2.45\times with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 6060-9090\% of processing time calculating convolutions, and speedups will tend to increase accordingly
    corecore