1,859 research outputs found
NBBS: A Non-blocking Buddy System for Multi-core Machines
Common implementations of core memory allocation components, like the Linux buddy system, handle concurrent allocation/release requests by synchronizing threads via spinlocks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocatorsâthe bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, where threads performing concurrent allocations/releases do not undergo any spinlock based synchronization. Our solution allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Conflict detection relies on conventional atomic machine instructions in the Read-Modify-Write (RMW) class. Beyond improving scalability and performance, our solution can also avoid wasting clock cycles for spin-lock operations by threads that could in principle carry out their memory allocation/release in full concurrency. Thus, it is resilient to performance degradationâin face of concurrent accessesâindependently of the current level of fragmentation of the handled memory blocks
Optical interconnection networks based on microring resonators
Optical microring resonators can be integrated on a chip to perform switching operations directly in the optical domain. Thus they become a building block to create switching elements in on-chip optical interconnection networks, which promise to overcome some of the limitations of current electronic networks. However, the peculiar asymmetric power losses of microring resonators impose new constraints on the design and control of on-chip optical networks. In this work, we study the design of multistage interconnection networks optimized for a particular metric that we name the degradation index, which characterizes the asymmetric behavior of microrings. We also propose a routing control algorithm to maximize the overall throughput, considering the maximum allowed degradation index as a constrain
Safety Analysis of Parameterised Networks with Non-Blocking Rendez-Vous
We consider networks of processes that all execute the same finite-state protocol and communicate via a rendez-vous mechanism. When a process requests a rendez-vous, another process can respond to it and they both change their control states accordingly. We focus here on a specific semantics, called non-blocking, where the process requesting a rendez-vous can change its state even if no process can respond to it. In this context, we study the parameterised coverability problem of a configuration, which consists in determining whether there is an initial number of processes and an execution allowing to reach a configuration bigger than a given one. We show that this problem is EXPSPACE-complete and can be solved in polynomial time if the protocol is partitioned into two sets of states, the states from which a process can request a rendez-vous and the ones from which it can answer one. We also prove that the problem of the existence of an execution bringing all the processes in a final state is undecidable in our context. These two problems can be solved in polynomial time with the classical rendez-vous semantics
A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines
Common implementations of core memory allocation components handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators - the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Beyond improving scalability and performance it is resilient to performance degradation in face of concurrent accesses independently of the current level of fragmentation of the handled memory blocks
Time4: Time for SDN
With the rise of Software Defined Networks (SDN), there is growing interest
in dynamic and centralized traffic engineering, where decisions about
forwarding paths are taken dynamically from a network-wide perspective.
Frequent path reconfiguration can significantly improve the network
performance, but should be handled with care, so as to minimize disruptions
that may occur during network updates.
In this paper we introduce Time4, an approach that uses accurate time to
coordinate network updates. Time4 is a powerful tool in softwarized
environments, that can be used for various network update scenarios.
Specifically, we characterize a set of update scenarios called flow swaps, for
which Time4 is the optimal update approach, yielding less packet loss than
existing update approaches. We define the lossless flow allocation problem, and
formally show that in environments with frequent path allocation, scenarios
that require simultaneous changes at multiple network devices are inevitable.
We present the design, implementation, and evaluation of a Time4-enabled
OpenFlow prototype. The prototype is publicly available as open source. Our
work includes an extension to the OpenFlow protocol that has been adopted by
the Open Networking Foundation (ONF), and is now included in OpenFlow 1.5. Our
experimental results show the significant advantages of Time4 compared to other
network update approaches, and demonstrate an SDN use case that is infeasible
without Time4.Comment: This report is an extended version of "Software Defined Networks:
It's About Time", which was accepted to IEEE INFOCOM 2016. A preliminary
version of this report was published in arXiv in May, 201
Using GPI-2 for Distributed Memory Paralleliziation of the Caffe Toolbox to Speed up Deep Neural Network Training
Deep Neural Network (DNN) are currently of great inter- est in research and
application. The training of these net- works is a compute intensive and time
consuming task. To reduce training times to a bearable amount at reasonable
cost we extend the popular Caffe toolbox for DNN with an efficient distributed
memory communication pattern. To achieve good scalability we emphasize the
overlap of computation and communication and prefer fine granu- lar
synchronization patterns over global barriers. To im- plement these
communication patterns we rely on the the Global address space Programming
Interface version 2 (GPI-2) communication library. This interface provides a
light-weight set of asynchronous one-sided communica- tion primitives
supplemented by non-blocking fine gran- ular data synchronization mechanisms.
Therefore, Caf- feGPI is the name of our parallel version of Caffe. First
benchmarks demonstrate better scaling behavior com- pared with other
extensions, e.g., the Intel TM Caffe. Even within a single symmetric
multiprocessing machine with four graphics processing units, the CaffeGPI
scales bet- ter than the standard Caffe toolbox. These first results
demonstrate that the use of standard High Performance Computing (HPC) hardware
is a valid cost saving ap- proach to train large DDNs. I/O is an other
bottleneck to work with DDNs in a standard parallel HPC setting, which we will
consider in more detail in a forthcoming paper
MPWide: a light-weight library for efficient message passing over wide area networks
We present MPWide, a light weight communication library which allows
efficient message passing over a distributed network. MPWide has been designed
to connect application running on distributed (super)computing resources, and
to maximize the communication performance on wide area networks for those
without administrative privileges. It can be used to provide message-passing
between application, move files, and make very fast connections in
client-server environments. MPWide has already been applied to enable
distributed cosmological simulations across up to four supercomputers on two
continents, and to couple two different bloodflow simulations to form a
multiscale simulation.Comment: accepted by the Journal Of Open Research Software, 13 pages, 4
figures, 1 tabl
- âŠ