212 research outputs found
A Lightweight Continuous Jobs Mechanism for MapReduce Frameworks
International audienceMapReduce is a programming model which allows the processing of vast amounts of data in parallel, on a large number of machines. It is particularly well suited to static or slow changing set of data since the execution time of a job is usually high. However, in practice data-centers collect data at fast rates which makes it very difficult to maintain up-to-date results. To address this challenge, we propose in this paper a generic mechanism for dealing with dynamic data in MapReduce frameworks. Long-standing MapReduce jobs, called continuous Jobs, are automatically re-executed to process new incoming data at a minimum cost. We present a simple and clean API which integrates nicely with the standard MapReduce model. Furthermore, we describe cHadoop, an implementation of our approach based on Hadoop which does not require modifications to the source code of the original framework. Thus, cHadoop can quickly be ported to any new version of Hadoop. We evaluate our proposal with two standard MapReduce applications (WordCount and WordCount-N-Count), and one real world application (RDF Query) on real datasets. Our evaluations on clusters ranging from 5 to 40 nodes demonstrate the benefit of our approach in terms of execution time and ease of use
A Generic API for Load Balancing in Structured P2P Systems
International audienceReal world datasets are known to be highly skewed, often leading to an important load imbalance issue for distributed systems managing them. To address this issue, there exist almost as many load balancing strategies as there are different systems. When designing a scalable distributed system geared towards handling large amounts of information, it is often not so easy to anticipate which kind of strategy will be the most efficient to maintain adequate performance regarding response time, scalability and reliability at any time. Based on this observation, we describe the methodology behind the building of a generic API to implement and experiment any strategy independently from the rest of the code, prior to a definitive choice for instance. We then show how this API is compatible with famous existing systems and their load balancing scheme. We also present results from our own distributed system which targets the continuous storage of events structured according to the Semantic Web standards, further retrieved by interested parties. As such, our system constitutes a typical example of a Big Data environment
Reinforcement adaptation of an attention-based neural natural language generator for spoken dialogue systems
Following some recent propositions to handle natural language generation in spoken dialogue systems with long short-term memory recurrent neural network models~\citep{Wen2016a} we first investigate a variant thereof with the objective of a better integration of the attention subnetwork. Then our next objective is to propose and evaluate a framework to adapt the NLG module online through direct interactions with the users. When doing so the basic way is to ask the user to utter an alternative sentence to express a particular dialogue act. But then the system has to decide between using an automatic transcription or to ask for a manual transcription. To do so a reinforcement learning approach based on an adversarial bandit scheme is retained. We show that by defining appropriately the rewards as a linear combination of expected payoffs and costs of acquiring the new data provided by the user, a system design can balance between improving the system's performance towards a better match with the user's preferences and the burden associated with it. Then the actual benefits of this system is assessed with a human evaluation, showing that the addition of more diverse utterances allows to produce sentences more satisfying for the user
Virtual Cloud: Rent Out the Rented Resources
International audienceWith the advent in cloud computing technologies, use of cloud computing infrastructure is increasing day by day and a lot of enterprises are shifting their computing from in-house infrastructure to the cloud infrastructure. Over a small period of time, it has substantiated to be an attractive choice for the enterprises. Especially for those, who wants to have minimal upfront cost for their technology infrastructure. This aspect of cloud computing makes it particularly suitable for a new enterprise. Currently, cloud services are a bit expensive, but a good number of enterprises and individuals can be attracted to the cloud computing by providing the low cost cloud services. In a fast growing cloud vendor market, provision of low cost cloud services is a difficult task for the cloud vendors. In this paper, we present a model of Virtual Cloud. The concept of Virtual cloud revolves around the concept, "Rent Out the Rented Resources''. It aims to reduce the monetary cost of cloud services. In this model, we propose to virtualize an already virtualized infrastructure. To achieve this, cloud vendor offers the low cost cloud services by acquiring underutilized resources from some big third party enterprise
Adaptive Fault Tolerance in Real Time Cloud Computing
International audienceWith the increasing demand and benefits of cloud computing infrastructure, real time computing can be performed on cloud infrastructure. A real time system can take advantage of intensive computing capabilities and scalable virtualized environment of cloud computing to execute real time tasks. In most of the real time cloud applications, processing is done on remote cloud computing nodes. So there are more chances of errors, due to the undetermined latency and loose control over computing node. On the other side, most of the real time systems are also safety critical and should be highly reliable. So there is an increased requirement for fault tolerance to achieve reliability for the real time computing on cloud infrastructure. In this paper, a fault tolerance model for real time cloud computing is proposed. In the proposed model, the system tolerates the faults and makes the decision on the basis of reliability of the processing nodes, i.e. virtual machines. The reliability of the virtual machines is adaptive, which changes after every computing cycle. If a virtual machine manages to produce a correct result within the time limit, its reliability increases. And if it fails to produce the result within time or correct result, its reliability decreases. A metric model is given for the reliability assessment. In the model, decrease in reliability is more than increase. If the node continues to fail, it is removed, and a new node is added. There is also a minimum reliability level. If any processing node does not achieve that level, the systems will perform backward recovery or safety measures. The proposed technique is based on the execution of design diverse variants on multiple virtual machines, and assigning reliability to the results produced by variants. The virtual machine instances can be of same type or of different types. The system provides both the forward and backward recovery mechanism, but main focus is on forward recovery. The main essence of the proposed technique is the adaptive behavior of the reliability weights assigned to each processing node and adding and removing of nodes on the basis of reliability
Dynamic TTL-Based Search In Unstructured Peer-to-Peer Networks
International audienceResource discovery is a challenging issue in unstructured peer-to-peer networks. Blind search approaches, including flooding and random walks, are the two typical algorithms used in such systems. Blind flooding is not scalable because of its high communication cost. On the other hand, the performance of random walks approaches largely depends on the random choice of walks. Some informed mechanisms use additional information, usually obtained from previous queries, for routing. Such approaches can reduce the traffic overhead but they limit the query coverage. Furthermore, they usually rely on complex protocols to maintain information at each peer. In this paper, we propose two schemes which can be used to improve the search performance in unstructured peer-to-peer networks. The first one is a simple caching mechanism based on resource descriptions. Peers that offer resources send periodic advertisement messages. These messages are stored into a cache and are used for routing requests. The second scheme is a dynamic Time-To-Live (TTL) enabling messages to break their horizon. Instead of decreasing the query TTL by 1 at each hop, it is decreased by a value v such as
Modular P2P-Based Approach for RDF Data Storage and Retrieval
International audienceOne of the key elements of the Semantic Web is the Resource Description Framework (RDF). Efficient storage and retrieval of RDF data in large scale settings is still challenging and existing solutions are monolithic and thus not very flexible from a software engineering point of view. In this paper, we propose a modular system, based on the scalable Content-Addressable Network (CAN), which gives the possibility to store and retrieve RDF data in large scale settings. We identified and isolated key components forming such system in our design architecture. We have evaluated our system using the Grid'5000 testbed over 300 peers on 75 machines and the outcome of these micro-benchmarks show interesting results in terms of scalability and concurrent queries
Componentising a scientific application for the grid
CoreGRID is a Network of Excellence funded by the European Commission under the Sixth Framework Programm
- …