12,706 research outputs found
Improving latency tolerance of multithreading through decoupling
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version
Service-Oriented Architecture for Space Exploration Robotic Rover Systems
Currently, industrial sectors are transforming their business processes into
e-services and component-based architectures to build flexible, robust, and
scalable systems, and reduce integration-related maintenance and development
costs. Robotics is yet another promising and fast-growing industry that deals
with the creation of machines that operate in an autonomous fashion and serve
for various applications including space exploration, weaponry, laboratory
research, and manufacturing. It is in space exploration that the most common
type of robots is the planetary rover which moves across the surface of a
planet and conducts a thorough geological study of the celestial surface. This
type of rover system is still ad-hoc in that it incorporates its software into
its core hardware making the whole system cohesive, tightly-coupled, more
susceptible to shortcomings, less flexible, hard to be scaled and maintained,
and impossible to be adapted to other purposes. This paper proposes a
service-oriented architecture for space exploration robotic rover systems made
out of loosely-coupled and distributed web services. The proposed architecture
consists of three elementary tiers: the client tier that corresponds to the
actual rover; the server tier that corresponds to the web services; and the
middleware tier that corresponds to an Enterprise Service Bus which promotes
interoperability between the interconnected entities. The niche of this
architecture is that rover's software components are decoupled and isolated
from the rover's body and possibly deployed at a distant location. A
service-oriented architecture promotes integrate-ability, scalability,
reusability, maintainability, and interoperability for client-to-server
communication.Comment: LACSC - Lebanese Association for Computational Sciences,
http://www.lacsc.org/; International Journal of Science & Emerging
Technologies (IJSET), Vol. 3, No. 2, February 201
An ultra low-power hardware accelerator for automatic speech recognition
Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.Peer ReviewedPostprint (author's final draft
The 3DMA Middleware for Mobile Applications
Mobile devices have received much research interest in re- cent years. Mobility raises new issues such as more dynamic context, limited computing resources, and frequent disconnections. To handle these issues, we propose a middleware, called 3DMA, which introduces three requirements, 1) distribution, 2) decoupling and 3) decomposition. 3DMA uses a space based middleware approach combined with a set of workers which are able to act on the users behalf either to reduce load on the mobile device, or to support disconnected behavior. In order to demonstrate aspects of the middleware architecture we consider the development of a commonly used mobile application
- …