34,697 research outputs found
Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters
Push-Pull Messaging is a novel messaging mechanism for high-speed interprocess communication in a cluster of symmetric multi-processors (SMP) machines. This messaging mechanism exploits the parallelism in SMP nodes by allowing the execution of communication stages of a messaging event on different processors to achieve maximum performance. Push-Pull Messaging facilitates further improvement on communication performance by employing three optimizing techniques in our design: (1) Cross-Space Zero Buffer provides a unified buffer management mechanism to achieve a copy-less communication for the data transfer among processes within a SMP node. (2) Address Translation Overhead Masking removes the address translation overhead from the critical path in the internode communication. (3) Push-and-Acknowledge Overlapping overlaps the push and acknowledge phases to hide the acknowledge latency. Overall, Push-Pull Messaging effectively utilizes the system resources and improves the communication speed. It has been implemented to support high-speed communication for connecting quad Pentium Pro SMPs with 100 Mbit/s Fast Ethernet.published_or_final_versio
Document distribution algorithm for load balancing on an extensible Web server architecture
Access latency and load balancing are the two main issues in the design of clustered Web server architecture for achieving high performance. We propose a novel document distribution algorithm for load balancing on a cluster of distributed Web servers. We group Web pages that are likely to be accessed during a request session into a migrating unit, which is used as the basic unit of document placement. A modified binning algorithm is developed to distribute the migrating units among the Web servers to fulfil the load balancing. We also present a redirection mechanism, which makes use of a migrating unit's property, to reduce the cost of request redirections. The distribution of Web documents would be recomputed periodically to adapt to the changes in client request patterns and system configuration. Simulation results show that our solution can reduce the amount of request redirection and document migration, and it can distribute workload properly among Web servers.published_or_final_versio
GPS calibrated ad-hoc localization for geosocial networking
LNCS v. 6406 is conference proceedings of UIC 2010Cost-effective localization for large-scale Geosocial networking service is a challenging issue in urban environment. This paper studies an ad-hoc localization technique which takes advantages of short-range interchanged location information for calibrating the location of mobile users carrying non-GPS mobile phones. We demonstrate by simulation that a small percentage of GPS-enabled mobile phones can greatly enable the localization of other non-GPS pedestrians in the urban environment. Based on the proposed localization technique, we implement a location-aware social networking tool called Mobile Twitter, similar to the microblogging service of Twitter, for fast propagation of social events happening in surroundings. Evaluation shows the our localization algorithm can achieve better accuracy of the location estimation and wider coverage as compared with the Amorphous algorithm and the Monte Carlo Localization (MCL) method. Moreover, we show that the Mobile Twitter implemented on an Android mobile phone is power-efficient in real-life usage scenarios. © 2010 Springer-Verlag.postprintThe 7th International Conference on Ubiquitous Intelligence and Computing (UIC) 2010, Xi'an, China, 26-29 October 2010. In Lecture Notes in Computer Science, 2010, v. 6406, p. 52-6
Scheduling parallel machines with inclusive processing set restrictions and job release times
2009-2010 > Academic research: refereed > Publication in refereed journalAccepted ManuscriptPublishe
Efficient reliable broadcast for commodity clusters
High-speed collective communication is the key to achieve high-performance computing in parallel computing. In the past, collective operations are usually implemented using unicast operations. We proposed a new architecture EQA (Enhanced Queue Architecture) for implementing high-speed collective operations in a cluster. With the incorporation of EQA and the hardware broadcast facility in network switches, an efficient reliable broadcast operation is implemented in a DP-SMP communication subsystem. With EQA, the computation, memory and network resources can be utilized efficiently. We evaluated the performance of the broadcast operation in a commodity cluster with fast Ethernet connection. We found that the hardware-based broadcast from DP-SMP with EQA outperforms the software-based broadcast operation. The use of EQA in broadcast operation could reduce the memory consumption by almost 40%. DP-SMP with EQA has proven to be an efficient communication mechanism for coupling commodity clusters.published_or_final_versio
Contention-Free Complete Exchange Algorithms on Clusters
To construct a large commodity clustec a hierarchical network is generally adopted for connecting the host muchines, where a Gigabit backbone switch connects a few commodity switches with uplinks to achieve scaled bisectional bandwidth. This type of interconnection usually results in link contention and has congestion developed at the uplink ports. Moreover, the non-detenninistic delays on scheduling communication events in clusters accelerate the building up of congestion amongst these uplink ports, which lead to severe packets drop and hinder the overall performance. In this paper, we focus on the practical design of high-speed complete exchange algorithm on a commodity cluster interconnected by a hierarchical Ethemet-based
network. By exploiting some architectural characteristics of the interconnection in optimizing the performunce of a complete exchange algorithm, we introduce a congestion control mechanism - global windowing that monitors and regulates the trafic load, together with a permutation scheme - reorder scheme that effectively alleviates the congestion problem. We evaluate our algorithm and compare its performance with other algorithms in a PC cluster connected by various types of switches, including Gigabit Ethernet, input-buffered and shared-memory Fast Ethernet switches.published_or_final_versio
Cache affinity optimization techniques for scaling software transactional memory systems on multi-CMP architectures
Software transactional memory (STM) enhances both ease-of-use and concurrency, and is considered one of the next-generation paradigms for parallel programming. Application programs may see hotspots where data conflicts are intensive and seriously degrade the performance. So advanced STM systems employ dynamic concurrency control techniques to curb the conflict rate through properly throttling the rate of spawning transactions. High-end computers may have two or more multicore processors so that data sharing among cores goes through a non-uniform cache memory hierarchy. This poses challenges to concurrency control designs as improper metadata placement and sharing will introduce scalability issues to the system. Poor thread-to-core mappings that induce excessive cache invalidation are also detrimental to the overall performance. In this paper, we share our experience in designing and implementing a new dynamic concurrency controller for Tiny STM, which helps keeping the system concurrency at a near-optimal level. By decoupling unfavourable metadata sharing, our controller design avoids costly inter-processor communications. It also features an affinity-aware thread migration technique that fine-tunes thread placements by observing inter-thread transactional conflicts. We evaluate our implementation using the STAMP benchmark suite and show that the controller can bring around 21% average speedup over the baseline execution. © 2015 IEEE.postprin
Conditional Image-Text Embedding Networks
This paper presents an approach for grounding phrases in images which jointly
learns multiple text-conditioned embeddings in a single end-to-end model. In
order to differentiate text phrases into semantically distinct subspaces, we
propose a concept weight branch that automatically assigns phrases to
embeddings, whereas prior works predefine such assignments. Our proposed
solution simplifies the representation requirements for individual embeddings
and allows the underrepresented concepts to take advantage of the shared
representations before feeding them into concept-specific layers. Comprehensive
experiments verify the effectiveness of our approach across three phrase
grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where
we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a
strong region-phrase embedding baseline.Comment: ECCV 2018 accepted pape
Defeating network jitter for virtual machines
Virtualization based cloud computing hosts networked applications in virtual machines (VMs), and provides each VM the desired degree of performance isolation using resource isolation mechanisms. Existing isolation solutions address heavily on resource proportionality such as CPU, memory and I/O bandwidth, but seldom focus on resource provisioning rate. Even the VM is allocated with adequate resources, if they can not be provided in a timely manner, problems such as network jitter will be very serious and significantly affect the performance of cloud applications like internet audio/video streaming. This paper systematically analyzes and illustrates the causes of unpredictable network latency in virtualized execution environments. We decouple the design goals of resource proportionality from resource provisioning rate, and adopt divide-and-conquer strategy to defeat network jitter for VMs: (1) in VMM CPU scheduling, we differentiate self-initiated I/O from event-triggered I/O, and individually map them to periodic and aperiodic real-time domains to schedule them together; (2) in network traffic shaping of VMs, we introduce the concept of smooth window to smooth network latency and apply closed-loop feedback control to maintain network resource consumption. We implement our solutions in Xen 4.1.0 and Linux 2.6.32.13. The experimental results with both real-life applications and low-level benchmarks show that our solutions can significantly reduce network jitter, and meanwhile effectively maintain resource proportionality.published_or_final_versionThe 4th IEEE International Conference on Utility and Cloud Computing (UCC 2011), Victoria, NSW, 5-8 December 2011. In Proceedings of the 4th IEEE-UCC, 2011, p. 65-7
- …
