665 research outputs found
Distributed learning of CNNs on heterogeneous CPU/GPU architectures
Convolutional Neural Networks (CNNs) have shown to be powerful classification
tools in tasks that range from check reading to medical diagnosis, reaching
close to human perception, and in some cases surpassing it. However, the
problems to solve are becoming larger and more complex, which translates to
larger CNNs, leading to longer training times that not even the adoption of
Graphics Processing Units (GPUs) could keep up to. This problem is partially
solved by using more processing units and distributed training methods that are
offered by several frameworks dedicated to neural network training. However,
these techniques do not take full advantage of the possible parallelization
offered by CNNs and the cooperative use of heterogeneous devices with different
processing capabilities, clock speeds, memory size, among others. This paper
presents a new method for the parallel training of CNNs that can be considered
as a particular instantiation of model parallelism, where only the
convolutional layer is distributed. In fact, the convolutions processed during
training (forward and backward propagation included) represent from -\%
of global processing time. The paper analyzes the influence of network size,
bandwidth, batch size, number of devices, including their processing
capabilities, and other parameters. Results show that this technique is capable
of diminishing the training time without affecting the classification
performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with
two convolutional layers, and and kernels, respectively, best
speedups achieve using four CPUs and with three GPUs.
Modern imaging datasets, larger and more complex than CIFAR-10 will certainly
require more than -\% of processing time calculating convolutions, and
speedups will tend to increase accordingly
Pragma-Oriented Parallelization of the Direct Sparse Odometry SLAM Algorithm
Monocular 3D reconstruction is a challenging computer
vision task that becomes even more stimulating when we
aim at real-time performance. One way to obtain 3D reconstruction
maps is through the use of Simultaneous Localization
and Mapping (SLAM), a recurrent engineering problem, mainly
in the area of robotics. It consists of building and updating a
consistent map of the unknown environment and, simultaneously,
saving the pose of the robot, or the camera, at every given time
instant. A variety of algorithms has been proposed to address
this problem, namely the Large Scale Direct Monocular SLAM
(LSD-SLAM), ORB-SLAM, Direct Sparse Odometry (DSO) or
Parallel Tracking and Mapping (PTAM), among others. However,
despite the fact that these algorithms provide good results, they
are computationally intensive.
Hence, in this paper, we propose a modified version of DSO
SLAM, which implements code parallelization techniques using
OpenMP, an API for introducing parallelism in C, C++ and
Fortran programs, that supports multi-platform shared memory
multi-processing programming. With this approach we propose
multiple directive-based code modifications, in order to make the
SLAM algorithm execute considerably faster. The performance
of the proposed solution was evaluated on standard datasets and
provides speedups above 40% without significant extra parallel
programming effort.info:eu-repo/semantics/publishedVersio
On the Evaluation of Energy-Efficient Deep Learning Using Stacked Autoencoders on Mobile GPUs
Over the last years, deep learning architectures have
gained attention by winning important international detection
and classification challenges. However, due to high levels of
energy consumption, the need to use low-power devices at
acceptable throughput performance is higher than ever. This
paper tries to solve this problem by introducing energy efficient
deep learning based on local training and using low-power mobile
GPU parallel architectures, all conveniently supported by the
same high-level description of the deep network. Also, it proposes
to discover the maximum dimensions that a particular type
of deep learning architecture—the stacked autoencoder—can
support by finding the hardware limitations of a representative
group of mobile GPUs and platforms.info:eu-repo/semantics/publishedVersio
Optimized Voronoi-based algorithms for parallel shortest vector computations
This paper addresses V ̈oronoi cell-based algorithms, specifically the ”Relevant Vectors” algorithm, used to solve the Shortest Vector Problem, a fundamental challenge in lattice-based cryptanalysis. Several optimizations are proposed to reduce the execution time of the original algorithm. It is also shown that the algorithm is highly suited for parallel execution on
both CPUs and GPUs. The proposed optimizations are based on pruning, i.e., avoiding computations that will not,
with high probability, improve the solution. The pruning criteria is related to the target vectors
norm relative to the current best solution vector norm. When pruning is performed without pre-processing, speedups up to 69× are observed compared to the original algorithm. If a pre-process sorting step is performed, which requires storing the norm ordered target vectors and therefore significantly more memory, this speedup increases to 77×.
On the parallel processing side, the multi-core version of the optimized algorithm exhibits
linear scalability on a CPU with up to 28 threads and keeps scaling, albeit at a lower rate,
with Simultaneous Multi-Threading with up to 56 threads. The lack of support for efficient
global synchronization among threads in GPUs does not allow for a scalable implementation of
the pruning optimization using these devices. Nevertheless, a parallel GPU version of the non
optimized algorithm is demonstrated to be competitive with the parallel non optimized CPU
version, although the latter outperforms the former when using 56 threads. It is argued that the
GPU version would outperform the CPU for higher lattice dimensions, although this statement
cannot be experimentally verified due to the limited memory available on current GPU boards
Observation of the doubly charmed baryon decay Ξcc++→Ξc′+π+
The Ξcc++→Ξc′+π+ decay is observed using proton-proton collisions collected by the LHCb experiment at a centre-of-mass energy of 13 TeV, corresponding to an integrated luminosity of 5.4 fb−1. The Ξcc++→Ξc′+π+ decay is reconstructed partially, where the photon from the Ξc′+→Ξc+γ decay is not reconstructed and the pK−π+ final state of the Ξc+ baryon is employed. The Ξcc++→Ξc′+π+branching fraction relative to that of the Ξcc++→Ξc+π+ decay is measured to be 1.41 ± 0.17 ± 0.10, where the first uncertainty is statistical and the second systematic. [Figure not available: see fulltext.
Test of lepton universality in decays
The first simultaneous test of muon-electron universality using
and decays is performed, in two ranges of the dilepton
invariant-mass squared, . The analysis uses beauty mesons produced in
proton-proton collisions collected with the LHCb detector between 2011 and
2018, corresponding to an integrated luminosity of 9 . Each
of the four lepton universality measurements reported is either the first in
the given interval or supersedes previous LHCb measurements. The
results are compatible with the predictions of the Standard Model.Comment: All figures and tables, along with any supplementary material and
additional information, are available at
https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2022-046.html (LHCb
public pages
Study of charmonium and charmonium-like contributions in B+ → J/ψηK+ decays
A study of B+→ J/ψηK+ decays, followed by J/ψ → μ+μ− and η → γγ, is performed using a dataset collected with the LHCb detector in proton-proton collisions at centre-of-mass energies of 7, 8 and 13 TeV, corresponding to an integrated luminosity of 9 fb−1. The J/ψη mass spectrum is investigated for contributions from charmonia and charmonium-like states. Evidence is found for the B+→ (ψ2(3823) → J/ψη)K+ and B+→ (ψ(4040) → J/ψη)K+ decays with significance of 3.4 and 4.7 standard deviations, respectively. This constitutes the first evidence for the ψ2(3823) → J/ψη decay
Second asymptomatic carotid surgery trial (ACST-2): a randomised comparison of carotid artery stenting versus carotid endarterectomy
Background: Among asymptomatic patients with severe carotid artery stenosis but no recent stroke or transient cerebral ischaemia, either carotid artery stenting (CAS) or carotid endarterectomy (CEA) can restore patency and reduce long-term stroke risks. However, from recent national registry data, each option causes about 1% procedural risk of disabling stroke or death. Comparison of their long-term protective effects requires large-scale randomised evidence. Methods: ACST-2 is an international multicentre randomised trial of CAS versus CEA among asymptomatic patients with severe stenosis thought to require intervention, interpreted with all other relevant trials. Patients were eligible if they had severe unilateral or bilateral carotid artery stenosis and both doctor and patient agreed that a carotid procedure should be undertaken, but they were substantially uncertain which one to choose. Patients were randomly allocated to CAS or CEA and followed up at 1 month and then annually, for a mean 5 years. Procedural events were those within 30 days of the intervention. Intention-to-treat analyses are provided. Analyses including procedural hazards use tabular methods. Analyses and meta-analyses of non-procedural strokes use Kaplan-Meier and log-rank methods. The trial is registered with the ISRCTN registry, ISRCTN21144362. Findings: Between Jan 15, 2008, and Dec 31, 2020, 3625 patients in 130 centres were randomly allocated, 1811 to CAS and 1814 to CEA, with good compliance, good medical therapy and a mean 5 years of follow-up. Overall, 1% had disabling stroke or death procedurally (15 allocated to CAS and 18 to CEA) and 2% had non-disabling procedural stroke (48 allocated to CAS and 29 to CEA). Kaplan-Meier estimates of 5-year non-procedural stroke were 2·5% in each group for fatal or disabling stroke, and 5·3% with CAS versus 4·5% with CEA for any stroke (rate ratio [RR] 1·16, 95% CI 0·86–1·57; p=0·33). Combining RRs for any non-procedural stroke in all CAS versus CEA trials, the RR was similar in symptomatic and asymptomatic patients (overall RR 1·11, 95% CI 0·91–1·32; p=0·21). Interpretation: Serious complications are similarly uncommon after competent CAS and CEA, and the long-term effects of these two carotid artery procedures on fatal or disabling stroke are comparable. Funding: UK Medical Research Council and Health Technology Assessment Programme
Precision measurement of violation in the penguin-mediated decay
A flavor-tagged time-dependent angular analysis of the decay
is performed using collision data collected
by the LHCb experiment at % at TeV, the center-of-mass energy of
13 TeV, corresponding to an integrated luminosity of 6 fb^{-1}. The
-violating phase and direct -violation parameter are measured
to be rad and
, respectively, assuming the same values
for all polarization states of the system. In these results, the
first uncertainties are statistical and the second systematic. These parameters
are also determined separately for each polarization state, showing no evidence
for polarization dependence. The results are combined with previous LHCb
measurements using collisions at center-of-mass energies of 7 and 8 TeV,
yielding rad and . This is the most precise study of time-dependent violation
in a penguin-dominated meson decay. The results are consistent with
symmetry and with the Standard Model predictions.Comment: All figures and tables, along with any supplementary material and
additional information, are available at
https://cern.ch/lhcbproject/Publications/p/LHCb-PAPER-2023-001.html (LHCb
public pages
Observation of the Decay Λ0b→Λ+cτ−¯ν
The first observation of the semileptonic b-baryon decay Λb0→Λc+τ-ν¯τ, with a significance of 6.1σ, is reported using a data sample corresponding to 3 fb-1 of integrated luminosity, collected by the LHCb experiment at center-of-mass energies of 7 and 8 TeV at the LHC. The τ- lepton is reconstructed in the hadronic decay to three charged pions. The ratio K=B(Λb0→Λc+τ-ν¯τ)/B(Λb0→Λc+π-π+π-) is measured to be 2.46±0.27±0.40, where the first uncertainty is statistical and the second systematic. The branching fraction B(Λb0→Λc+τ-ν¯τ)=(1.50±0.16±0.25±0.23)% is obtained, where the third uncertainty is from the external branching fraction of the normalization channel Λb0→Λc+π-π+π-. The ratio of semileptonic branching fractions R(Λc+)B(Λb0→Λc+τ-ν¯τ)/B(Λb0→Λc+μ-ν¯μ) is derived to be 0.242±0.026±0.040±0.059, where the external branching fraction uncertainty from the channel Λb0→Λc+μ-ν¯μ contributes to the last term. This result is in agreement with the standard model prediction
- …