We describe the design, implementation and performance of the new hybrid
parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which
has been used extensively for modeling the continuum radiation of dusty
astrophysical systems including late-type galaxies and dusty tori. The hybrid
scheme combines distributed memory parallelization, using the standard Message
Passing Interface (MPI) to communicate between processes, and shared memory
parallelization, providing multiple execution threads within each process to
avoid duplication of data structures. The synchronization between multiple
threads is accomplished through atomic operations without high-level locking
(also called lock-free programming). This improves the scaling behavior of the
code and substantially simplifies the implementation of the hybrid scheme. The
result is an extremely flexible solution that adjusts to the number of
available nodes, processors and memory, and consequently performs well on a
wide variety of computing architectures.Comment: 21 pages, 20 figure