89,303 research outputs found
Design Architecture-Based on Web Server and Application Cluster in Cloud Environment
Cloud has been a computational and storage solution for many data centric
organizations. The problem today those organizations are facing from the cloud
is in data searching in an efficient manner. A framework is required to
distribute the work of searching and fetching from thousands of computers. The
data in HDFS is scattered and needs lots of time to retrieve. The major idea is
to design a web server in the map phase using the jetty web server which will
give a fast and efficient way of searching data in MapReduce paradigm. For real
time processing on Hadoop, a searchable mechanism is implemented in HDFS by
creating a multilevel index in web server with multi-level index keys. The web
server uses to handle traffic throughput. By web clustering technology we can
improve the application performance. To keep the work down, the load balancer
should automatically be able to distribute load to the newly added nodes in the
server
On data skewness, stragglers, and MapReduce progress indicators
We tackle the problem of predicting the performance of MapReduce
applications, designing accurate progress indicators that keep programmers
informed on the percentage of completed computation time during the execution
of a job. Through extensive experiments, we show that state-of-the-art progress
indicators (including the one provided by Hadoop) can be seriously harmed by
data skewness, load unbalancing, and straggling tasks. This is mainly due to
their implicit assumption that the running time depends linearly on the input
size. We thus design a novel profile-guided progress indicator, called
NearestFit, that operates without the linear hypothesis assumption and exploits
a careful combination of nearest neighbor regression and statistical curve
fitting techniques. Our theoretical progress model requires fine-grained
profile data, that can be very difficult to manage in practice. To overcome
this issue, we resort to computing accurate approximations for some of the
quantities used in our model through space- and time-efficient data streaming
algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive
empirical assessment over the Amazon EC2 platform on a variety of real-world
benchmarks shows that NearestFit is practical w.r.t. space and time overheads
and that its accuracy is generally very good, even in scenarios where
competitors incur non-negligible errors and wide prediction fluctuations.
Overall, NearestFit significantly improves the current state-of-art on progress
analysis for MapReduce
A horizontally-scalable multiprocessing platform based on Node.js
This paper presents a scalable web-based platform called Node Scala which
allows to split and handle requests on a parallel distributed system according
to pre-defined use cases. We applied this platform to a client application that
visualizes climate data stored in a NoSQL database MongoDB. The design of Node
Scala leads to efficient usage of available computing resources in addition to
allowing the system to scale simply by adding new workers. Performance
evaluation of Node Scala demonstrated a gain of up to 74 % compared to the
state-of-the-art techniques.Comment: 8 pages, 7 figures. Accepted for publication as a conference paper
for the 13th IEEE International Symposium on Parallel and Distributed
Processing with Applications (IEEE ISPA-15
A grid-based approach for processing group activity log files
The information collected regarding group activity in a collaborative learning environment requires classifying, structuring and processing. The aim is to process this information in order to extract, reveal and provide students and tutors with valuable knowledge, awareness and feedback in order to successfully perform the collaborative learning activity. However, the large amount of information generated during online group activity may be time-consuming to process and, hence, can hinder the real-time delivery of the information. In this study we show how a Grid-based paradigm can be used to effectively process and present the information regarding group activity gathered in the log files under a collaborative environment. The computational power of the Grid makes it possible to process a huge amount of event information, compute statistical results and present them, when needed, to the members of the online group and the tutors, who are geographically distributed.Peer ReviewedPostprint (author's final draft
- …