17 research outputs found
Risk Intelligence: Making Profit from Uncertainty in Data Processing System
In extreme scale data processing systems, fault tolerance is an essential and indispensable part. Proactive fault tolerance scheme (such as the speculative execution in MapReduce framework) is introduced to dramatically improve the response time of job executions when the failure becomes a norm rather than an exception. Efficient proactive fault tolerance schemes require precise knowledge on the task executions, which has been an open challenge for decades. To well address the issue, in this paper we design and implement RiskI, a profile-based prediction algorithm in conjunction with a riskaware task assignment algorithm, to accelerate task executions, taking the uncertainty nature of tasks into account. Our design demonstrates that the nature uncertainty brings not only great challenges, but also new opportunities. With a careful design, we can benefit from such uncertainties. We implement the idea in Hadoop 0.21.0 systems and the experimental results show that, compared with the traditional LATE algorithm, the response time can be improved by 46% with the same system throughput
Giving Users the Steering Wheel for Guiding Resource-Adaptive Systems
This material is based upon work supported by the National Science Foundation (NSF) unde
Design and analysis of a 3-dimensional cluster multicomputer architecture using optical interconnection for petaFLOP computing
In this dissertation, the design and analyses of an extremely scalable distributed
multicomputer architecture, using optical interconnects, that has the potential to
deliver in the order of petaFLOP performance is presented in detail. The design
takes advantage of optical technologies, harnessing the features inherent in optics,
to produce a 3D stack that implements efficiently a large, fully connected system of
nodes forming a true 3D architecture. To adopt optics in large-scale multiprocessor
cluster systems, efficient routing and scheduling techniques are needed. To this
end, novel self-routing strategies for all-optical packet switched networks and on-line
scheduling methods that can result in collision free communication and achieve real
time operation in high-speed multiprocessor systems are proposed. The system is designed
to allow failed/faulty nodes to stay in place without appreciable performance
degradation. The approach is to develop a dynamic communication environment that
will be able to effectively adapt and evolve with a high density of missing units or
nodes. A joint CPU/bandwidth controller that maximizes the resource allocation in
this dynamic computing environment is introduced with an objective to optimize the
distributed cluster architecture, preventing performance/system degradation in the
presence of failed/faulty nodes. A thorough analysis, feasibility study and description of the characteristics of a 3-Dimensional multicomputer system capable of achieving
100 teraFLOP performance is discussed in detail. Included in this dissertation is
throughput analysis of the routing schemes, using methods from discrete-time queuing
systems and computer simulation results for the different proposed algorithms. A
prototype of the 3D architecture proposed is built and a test bed developed to obtain
experimental results to further prove the feasibility of the design, validate initial assumptions,
algorithms, simulations and the optimized distributed resource allocation
scheme. Finally, as a prelude to further research, an efficient data routing strategy
for highly scalable distributed mobile multiprocessor networks is introduced