665 research outputs found

    Distributed learning of CNNs on heterogeneous CPU/GPU architectures

    Get PDF
    Convolutional Neural Networks (CNNs) have shown to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are offered by several frameworks dedicated to neural network training. However, these techniques do not take full advantage of the possible parallelization offered by CNNs and the cooperative use of heterogeneous devices with different processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 6060-9090\% of global processing time. The paper analyzes the influence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without affecting the classification performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500500 and 15001500 kernels, respectively, best speedups achieve 3.28×3.28\times using four CPUs and 2.45×2.45\times with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 6060-9090\% of processing time calculating convolutions, and speedups will tend to increase accordingly

    Pragma-Oriented Parallelization of the Direct Sparse Odometry SLAM Algorithm

    Get PDF
    Monocular 3D reconstruction is a challenging computer vision task that becomes even more stimulating when we aim at real-time performance. One way to obtain 3D reconstruction maps is through the use of Simultaneous Localization and Mapping (SLAM), a recurrent engineering problem, mainly in the area of robotics. It consists of building and updating a consistent map of the unknown environment and, simultaneously, saving the pose of the robot, or the camera, at every given time instant. A variety of algorithms has been proposed to address this problem, namely the Large Scale Direct Monocular SLAM (LSD-SLAM), ORB-SLAM, Direct Sparse Odometry (DSO) or Parallel Tracking and Mapping (PTAM), among others. However, despite the fact that these algorithms provide good results, they are computationally intensive. Hence, in this paper, we propose a modified version of DSO SLAM, which implements code parallelization techniques using OpenMP, an API for introducing parallelism in C, C++ and Fortran programs, that supports multi-platform shared memory multi-processing programming. With this approach we propose multiple directive-based code modifications, in order to make the SLAM algorithm execute considerably faster. The performance of the proposed solution was evaluated on standard datasets and provides speedups above 40% without significant extra parallel programming effort.info:eu-repo/semantics/publishedVersio

    On the Evaluation of Energy-Efficient Deep Learning Using Stacked Autoencoders on Mobile GPUs

    Get PDF
    Over the last years, deep learning architectures have gained attention by winning important international detection and classification challenges. However, due to high levels of energy consumption, the need to use low-power devices at acceptable throughput performance is higher than ever. This paper tries to solve this problem by introducing energy efficient deep learning based on local training and using low-power mobile GPU parallel architectures, all conveniently supported by the same high-level description of the deep network. Also, it proposes to discover the maximum dimensions that a particular type of deep learning architecture—the stacked autoencoder—can support by finding the hardware limitations of a representative group of mobile GPUs and platforms.info:eu-repo/semantics/publishedVersio

    Optimized Voronoi-based algorithms for parallel shortest vector computations

    Get PDF
    This paper addresses V ̈oronoi cell-based algorithms, specifically the ”Relevant Vectors” algorithm, used to solve the Shortest Vector Problem, a fundamental challenge in lattice-based cryptanalysis. Several optimizations are proposed to reduce the execution time of the original algorithm. It is also shown that the algorithm is highly suited for parallel execution on both CPUs and GPUs. The proposed optimizations are based on pruning, i.e., avoiding computations that will not, with high probability, improve the solution. The pruning criteria is related to the target vectors norm relative to the current best solution vector norm. When pruning is performed without pre-processing, speedups up to 69× are observed compared to the original algorithm. If a pre-process sorting step is performed, which requires storing the norm ordered target vectors and therefore significantly more memory, this speedup increases to 77×. On the parallel processing side, the multi-core version of the optimized algorithm exhibits linear scalability on a CPU with up to 28 threads and keeps scaling, albeit at a lower rate, with Simultaneous Multi-Threading with up to 56 threads. The lack of support for efficient global synchronization among threads in GPUs does not allow for a scalable implementation of the pruning optimization using these devices. Nevertheless, a parallel GPU version of the non optimized algorithm is demonstrated to be competitive with the parallel non optimized CPU version, although the latter outperforms the former when using 56 threads. It is argued that the GPU version would outperform the CPU for higher lattice dimensions, although this statement cannot be experimentally verified due to the limited memory available on current GPU boards

    Observation of the doubly charmed baryon decay Ξcc++→Ξc′+π+

    Get PDF
    The Ξcc++→Ξc′+π+ decay is observed using proton-proton collisions collected by the LHCb experiment at a centre-of-mass energy of 13 TeV, corresponding to an integrated luminosity of 5.4 fb−1. The Ξcc++→Ξc′+π+ decay is reconstructed partially, where the photon from the Ξc′+→Ξc+γ decay is not reconstructed and the pK−π+ final state of the Ξc+ baryon is employed. The Ξcc++→Ξc′+π+branching fraction relative to that of the Ξcc++→Ξc+π+ decay is measured to be 1.41 ± 0.17 ± 0.10, where the first uncertainty is statistical and the second systematic. [Figure not available: see fulltext.

    Test of lepton universality in bs+b \rightarrow s \ell^+ \ell^- decays

    Get PDF
    The first simultaneous test of muon-electron universality using B+K++B^{+}\rightarrow K^{+}\ell^{+}\ell^{-} and B0K0+B^{0}\rightarrow K^{*0}\ell^{+}\ell^{-} decays is performed, in two ranges of the dilepton invariant-mass squared, q2q^{2}. The analysis uses beauty mesons produced in proton-proton collisions collected with the LHCb detector between 2011 and 2018, corresponding to an integrated luminosity of 9 fb1\mathrm{fb}^{-1}. Each of the four lepton universality measurements reported is either the first in the given q2q^{2} interval or supersedes previous LHCb measurements. The results are compatible with the predictions of the Standard Model.Comment: All figures and tables, along with any supplementary material and additional information, are available at https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2022-046.html (LHCb public pages

    Study of charmonium and charmonium-like contributions in B+ → J/ψηK+ decays

    Get PDF
    A study of B+→ J/ψηK+ decays, followed by J/ψ → μ+μ− and η → γγ, is performed using a dataset collected with the LHCb detector in proton-proton collisions at centre-of-mass energies of 7, 8 and 13 TeV, corresponding to an integrated luminosity of 9 fb−1. The J/ψη mass spectrum is investigated for contributions from charmonia and charmonium-like states. Evidence is found for the B+→ (ψ2(3823) → J/ψη)K+ and B+→ (ψ(4040) → J/ψη)K+ decays with significance of 3.4 and 4.7 standard deviations, respectively. This constitutes the first evidence for the ψ2(3823) → J/ψη decay

    Second asymptomatic carotid surgery trial (ACST-2): a randomised comparison of carotid artery stenting versus carotid endarterectomy

    Get PDF
    Background: Among asymptomatic patients with severe carotid artery stenosis but no recent stroke or transient cerebral ischaemia, either carotid artery stenting (CAS) or carotid endarterectomy (CEA) can restore patency and reduce long-term stroke risks. However, from recent national registry data, each option causes about 1% procedural risk of disabling stroke or death. Comparison of their long-term protective effects requires large-scale randomised evidence. Methods: ACST-2 is an international multicentre randomised trial of CAS versus CEA among asymptomatic patients with severe stenosis thought to require intervention, interpreted with all other relevant trials. Patients were eligible if they had severe unilateral or bilateral carotid artery stenosis and both doctor and patient agreed that a carotid procedure should be undertaken, but they were substantially uncertain which one to choose. Patients were randomly allocated to CAS or CEA and followed up at 1 month and then annually, for a mean 5 years. Procedural events were those within 30 days of the intervention. Intention-to-treat analyses are provided. Analyses including procedural hazards use tabular methods. Analyses and meta-analyses of non-procedural strokes use Kaplan-Meier and log-rank methods. The trial is registered with the ISRCTN registry, ISRCTN21144362. Findings: Between Jan 15, 2008, and Dec 31, 2020, 3625 patients in 130 centres were randomly allocated, 1811 to CAS and 1814 to CEA, with good compliance, good medical therapy and a mean 5 years of follow-up. Overall, 1% had disabling stroke or death procedurally (15 allocated to CAS and 18 to CEA) and 2% had non-disabling procedural stroke (48 allocated to CAS and 29 to CEA). Kaplan-Meier estimates of 5-year non-procedural stroke were 2·5% in each group for fatal or disabling stroke, and 5·3% with CAS versus 4·5% with CEA for any stroke (rate ratio [RR] 1·16, 95% CI 0·86–1·57; p=0·33). Combining RRs for any non-procedural stroke in all CAS versus CEA trials, the RR was similar in symptomatic and asymptomatic patients (overall RR 1·11, 95% CI 0·91–1·32; p=0·21). Interpretation: Serious complications are similarly uncommon after competent CAS and CEA, and the long-term effects of these two carotid artery procedures on fatal or disabling stroke are comparable. Funding: UK Medical Research Council and Health Technology Assessment Programme

    Precision measurement of CP\it{CP} violation in the penguin-mediated decay Bs0ϕϕB_s^{0}\rightarrow\phi\phi

    Get PDF
    A flavor-tagged time-dependent angular analysis of the decay Bs0ϕϕB_s^{0}\rightarrow\phi\phi is performed using pppp collision data collected by the LHCb experiment at % at s=13\sqrt{s}=13 TeV, the center-of-mass energy of 13 TeV, corresponding to an integrated luminosity of 6 fb^{-1}. The CP\it{CP}-violating phase and direct CP\it{CP}-violation parameter are measured to be ϕssˉs=0.042±0.075±0.009\phi_{s\bar{s}s} = -0.042 \pm 0.075 \pm 0.009 rad and λ=1.004±0.030±0.009|\lambda|=1.004\pm 0.030 \pm 0.009 , respectively, assuming the same values for all polarization states of the ϕϕ\phi\phi system. In these results, the first uncertainties are statistical and the second systematic. These parameters are also determined separately for each polarization state, showing no evidence for polarization dependence. The results are combined with previous LHCb measurements using pppp collisions at center-of-mass energies of 7 and 8 TeV, yielding ϕssˉs=0.074±0.069\phi_{s\bar{s}s} = -0.074 \pm 0.069 rad and lambda=1.009±0.030|lambda|=1.009 \pm 0.030. This is the most precise study of time-dependent CP\it{CP} violation in a penguin-dominated BB meson decay. The results are consistent with CP\it{CP} symmetry and with the Standard Model predictions.Comment: All figures and tables, along with any supplementary material and additional information, are available at https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2023-001.html (LHCb public pages

    Observation of the Decay Λ0b→Λ+cτ−¯ν

    Get PDF
    The first observation of the semileptonic b-baryon decay Λb0→Λc+τ-ν¯τ, with a significance of 6.1σ, is reported using a data sample corresponding to 3 fb-1 of integrated luminosity, collected by the LHCb experiment at center-of-mass energies of 7 and 8 TeV at the LHC. The τ- lepton is reconstructed in the hadronic decay to three charged pions. The ratio K=B(Λb0→Λc+τ-ν¯τ)/B(Λb0→Λc+π-π+π-) is measured to be 2.46±0.27±0.40, where the first uncertainty is statistical and the second systematic. The branching fraction B(Λb0→Λc+τ-ν¯τ)=(1.50±0.16±0.25±0.23)% is obtained, where the third uncertainty is from the external branching fraction of the normalization channel Λb0→Λc+π-π+π-. The ratio of semileptonic branching fractions R(Λc+)B(Λb0→Λc+τ-ν¯τ)/B(Λb0→Λc+μ-ν¯μ) is derived to be 0.242±0.026±0.040±0.059, where the external branching fraction uncertainty from the channel Λb0→Λc+μ-ν¯μ contributes to the last term. This result is in agreement with the standard model prediction
    corecore