2 research outputs found

    Virtual Cluster Management for Analysis of Geographically Distributed and Immovable Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2015Scenarios exist in the era of Big Data where computational analysis needs to utilize widely distributed and remote compute clusters, especially when the data sources are sensitive or extremely large, and thus unable to move. A large dataset in Malaysia could be ecologically sensitive, for instance, and unable to be moved outside the country boundaries. Controlling an analysis experiment in this virtual cluster setting can be difficult on multiple levels: with setup and control, with managing behavior of the virtual cluster, and with interoperability issues across the compute clusters. Further, datasets can be distributed among clusters, or even across data centers, so that it becomes critical to utilize data locality information to optimize the performance of data-intensive jobs. Finally, datasets are increasingly sensitive and tied to certain administrative boundaries, though once the data has been processed, the aggregated or statistical result can be shared across the boundaries. This dissertation addresses management and control of a widely distributed virtual cluster having sensitive or otherwise immovable data sets through a controller. The Virtual Cluster Controller (VCC) gives control back to the researcher. It creates virtual clusters across multiple cloud platforms. In recognition of sensitive data, it can establish a single network overlay over widely distributed clusters. We define a novel class of data, notably immovable data that we call "pinned data", where the data is treated as a first-class citizen instead of being moved to where needed. We draw from our earlier work with a hierarchical data processing model, Hierarchical MapReduce (HMR), to process geographically distributed data, some of which are pinned data. The applications implemented in HMR use extended MapReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Further, by facilitating information sharing among resources, applications, and data, the overall performance is improved. Experimental results show that the overhead of VCC is minimum. The HMR outperforms traditional MapReduce model while processing a particular class of applications. The evaluations also show that information sharing between resources and application through the VCC shortens the hierarchical data processing time, as well satisfying the constraints on the pinned data

    Grid application meta-repository system

    Get PDF
    As one of the most popular forms of distributed computing technology, Grid brings together different scientific communities that are able to deploy, access, and run complex applications with the help of the enormous computational and storage power offered by the Grid infrastructure. However as the number of Grid applications has been growing steadily in recent years, they are now stored on a multitude of different repositories, which remain specific to each Grid. At the time this research was carried out there were no two well-known Grid application repositories sharing the same structure, same implementation, same access technology and methods, same communication protocols, same security system or same application description language used for application descriptions. This remained a great limitation for Grid users, who were bound to work on only one specific repository, and also presented a significant limitation in terms of interoperability and inter-repository access. The research presented in this thesis provides a solution to this problem, as well as to several other related issues that have been identified while investigating these areas of Grid. Following a comprehensive review of existing Grid repository capabilities, I defined the main challenges that need to be addressed in order to make Grid repositories more versatile and I proposed a solution that addresses these challenges. To this end, I designed a new Grid repository (GAMRS – Grid Application Meta-Repository System), which includes a novel model and architecture, an improved application description language and a matchmaking system. After implementing and testing this solution, I have proved that GAMRS marks an improvement in Grid repository systems. Its new features allow for the inter-connection of different Grid repositories; make applications stored on these repositories visible on the web; allow for the discovery of similar or identical applications stored in different Grid repositories; permit the exchange and re-usage of application and applicationrelated objects between different repositories; and extend the use of applications stored on Grid repositories to other distributed environments, such as virtualized cluster-on-demand and cloud computing
    corecore