40,948 research outputs found

    Distributed data mining in grid computing environments

    Get PDF
    The official published version of this article can be found at the link below.The computing-intensive data mining for inherently Internet-wide distributed data, referred to as Distributed Data Mining (DDM), calls for the support of a powerful Grid with an effective scheduling framework. DDM often shares the computing paradigm of local processing and global synthesizing. It involves every phase of Data Mining (DM) processes, which makes the workflow of DDM very complex and can be modelled only by a Directed Acyclic Graph (DAG) with multiple data entries. Motivated by the need for a practical solution of the Grid scheduling problem for the DDM workflow, this paper proposes a novel two-phase scheduling framework, including External Scheduling and Internal Scheduling, on a two-level Grid architecture (InterGrid, IntraGrid). Currently a DM IntraGrid, named DMGCE (Data Mining Grid Computing Environment), has been developed with a dynamic scheduling framework for competitive DAGs in a heterogeneous computing environment. This system is implemented in an established Multi-Agent System (MAS) environment, in which the reuse of existing DM algorithms is achieved by encapsulating them into agents. Practical classification problems from oil well logging analysis are used to measure the system performance. The detailed experiment procedure and result analysis are also discussed in this paper

    Grid data mining for outcome prediction in intensive care medicine

    Get PDF
    This paper introduces a distributed data mining approach suited to grid computing environments based on a supervised learning classifier system. Specific Classifier and Majority Voting methods for Distributed Data Mining (DDM) are explored and compared with the Centralized Data Mining (CDM) approach. Experimental tests were conducted considering a real world data set from the intensive care medicine in order to predict the outcome of the patients. The results demonstrate that the performance of the DDM methods are better than the CDM method.Fundação para a Ciência e a Tecnologia (FCT

    Grid data mining by means of learning classifier systems and distributed model induction

    Get PDF
    This paper introduces a distributed data mining approach suited to grid computing environments based on a supervised learning classifier system. Different methods of merging data mining models generated at different distributed sites are explored. Centralized Data Mining (CDM) is a conventional method of data mining in distributed data. In CDM, data that is stored in distributed locations have to be collected and stored in a central repository before executing the data mining algorithm. CDM method is reliable; however it is expensive (computational, communicational and implementation costs are high). Alternatively, Distributed Data Mining (DDM) approach is economical but it has limitations in combining local models. In DDM, the data mining algorithm has to be executed at each one of the sites to induce a local model. Those induced local models are collected and combined to form a global data mining model. In this work six different tactics are used for constructing the global model in DDM: Generalized Classifier Method (GCM); Specific Classifier Method (SCM); Weighed Classifier Method (WCM); Majority Voting Method (MVM); Model Sampling Method (MSM); and Centralized Training Method (CTM). Preliminary experimental tests were conducted with two synthetic data sets (eleven multiplexer and monks3) and a real world data set (intensive care medicine). The initial results demonstrate that the performance of DDM methods is competitive when compared with the CDM methods.Fundação para a Ciência e a Tecnologia (FCT

    A job response time prediction method for production Grid computing environments

    Get PDF
    A major obstacle to the widespread adoption of Grid Computing in both the scientific community and industry sector is the difficulty of knowing in advance a job submission running cost that can be used to plan a correct allocation of resources. Traditional distributed computing solutions take advantage of homogeneous and open environments to propose prediction methods that use a detailed analysis of the hardware and software components. However, production Grid computing environments, which are large and use a complex and dynamic set of resources, present a different challenge. In Grid computing the source code of applications, programme libraries, and third-party software are not always available. In addition, Grid security policies may not agree to run hardware or software analysis tools to generate Grid components models. The objective of this research is the prediction of a job response time in production Grid computing environments. The solution is inspired by the concept of predicting future Grid behaviours based on previous experiences learned from heterogeneous Grid workload trace data. The research objective was selected with the aim of improving the Grid resource usability and the administration of Grid environments. The predicted data can be used to allocate resources in advance and inform forecasted finishing time and running costs before submission. The proposed Grid Computing Response Time Prediction (GRTP) method implements several internal stages where the workload traces are mined to produce a response time prediction for a given job. In addition, the GRTP method assesses the predicted result against the actual target job’s response time to inference information that is used to tune the methods setting parameters. The GRTP method was implemented and tested using a cross-validation technique to assess how the proposed solution generalises to independent data sets. The training set was taken from the Grid environment DAS (Distributed ASCI Supercomputer). The two testing sets were taken from AuverGrid and Grid5000 Grid environments Three consecutive tests assuming stable jobs, unstable jobs, and using a job type method to select the most appropriate prediction function were carried out. The tests offered a significant increase in prediction performance for data mining based methods applied in Grid computing environments. For instance, in Grid5000 the GRTP method answered 77 percent of job prediction requests with an error of less than 10 percent. While in the same environment, the most effective and accurate method using workload traces was only able to predict 32 percent of the cases within the same range of error. The GRTP method was able to handle unexpected changes in resources and services which affect the job response time trends and was able to adapt to new scenarios. The tests showed that the proposed GRTP method is capable of predicting job response time requests and it also improves the prediction quality when compared to other current solutions

    Grid data mining strategies for outcome prediction in distributed intensive care units

    Get PDF
    Previous work developed to predict the outcome of patients in the context of intensive care units brought to the light some requirements like the need to deal with distributed data sources. Those data sources can be used to induce local prediction models and those models can in turn be used to induce global models more accurate and more general than the local models. This paper introduces a distributed data mining approach suited to grid computing environments based on a supervised learning classifier system. Five different tactics are explored for constructing the global model in a Distributed Data Mining (DDM) approach: Generalized Classifier Method (GCM); Specific Classifier Method (SCM); Weighed Classifier Method (WCM); Majority Voting Method (MVM); and Model Sampling Method (MSM). Experimental tests were conducted with a real world data set from the intensive care medicine. The results demonstrate that the performance of DDM methods is very competitive when compared with the centralized methods.Fundação para a Ciência e a Tecnologia (FCT

    A Parallel Implementation of the K Nearest Neighbours Classifier in Three Levels: Threads MPI Processes and the Grid

    Full text link
    The work described in this paper tackles the problem of data mining and classification of large amounts of data using the K nearest neighbours classifier (KNN) [1]. The large computing demand of this process is solved with a parallel computing implementation specially designed to work in Grid environments of multiprocessor computer farms. The different parallel computing approaches (intra-node, inter-node and inter-organisations) are not sufficient by themselves to face the computing demand of such a big problem. Instead of using parallel techniques separately, we propose to combine the three of them considering the parallelism grain of the different parts of the problem. The main purpose is to complete a 1 month-CPU job in a few hours. The technologies that are being used are the EGEE Grid Computing Infrastructure running the Large Hadron Collider Computing Grid (LCG 2.6) middleware [3], MPI [4] [5] and POSIX [6] threads. Finally, we compare the results obtained with the most popular and used tools to understand the importance of this strategy.Aparicio Pla, G.; Blanquer Espert, I.; Hernández García, V. (2007). A Parallel Implementation of the K Nearest Neighbours Classifier in Three Levels: Threads MPI Processes and the Grid. En High Performance Computing for Computational Science - VECPAR 2006. Springer Verlag (Germany). 225-235. doi:10.1007/978-3-540-71351-7_18S225235Cover, T.M., Hart, P.E.: Nearest neighbour pattern recognition. IEEE Trans. on Information Theory 13(1), 2127 (1967)Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International J. Supercomputer Applications 15(3) (2001), http://www.globus.org/research/papers/anatomy.pdfLCG: World Wide Web Computing Grid. Distributed Production Environment of Physics Data Processing. http://lcg.web.cern.ch/LCGMessage Passing Interface Forum: MPI: A message-passing interface standard (2003), http://www.mpi-forum.org/Gropp, W., et al.: MPI: The Complete Reference. MIT Press, Cambridge (1998)Drepper, U., Molnar, I.: The Native POSIX Thread Library for Linux (2003), http://people.redhat.com/drepper/nptl-design.pdfFrank, E., Hall, M., L.T.: Weka 3: Data Mining Software in Java (2005), http://www.cs.waikato.ac.nz/ml/wek

    An Approach to Model Resources Rationalisation in Hybrid Clouds through Users Activity Characterisation

    Get PDF
    In recent years, some strategies (e.g., server consolidation by means of virtualisation techniques) helped the managers of large Information Technology (IT) infrastructures to limit, when possible, the use of hardware resources in order to provide reliable services and to reduce the Total Cost of Ownership (TCO) of such infrastructures. Moreover, with the advent of Cloud computing, a resource usage rationalisation can be pursued also for the users applications, if this is compatible with the Quality of Service (QoS) which must be guaranteed. In this perspective, modern datacenters are “elastic”, i.e., able to shrink or enlarge the number of local physical or virtual resources from private/public Clouds. Moreover, many of large computing environments are integrated in distributed computing environment as the grid and cloud infrastructures. In this document, we report some advances in the realisation of a utility, we named Adaptive Scheduling Controller (ASC) which, interacting with the datacenter resource manager, allows an effective and efficient usage of resources, also by means of users jobs classification. Here, we focus both on some data mining algorithms which allows to classify the users activity and on the mathematical formalisation of the functional used by ASC to find the most suitable configuration for the datacenter’s resource manager. The presented case study concerns the SCoPE infrastructure, which has a twofold role: local computing resources provider for the University of Naples Federico II and remote resources provider for both the Italian Grid Infrastructure (IGI) and the European Grid Infrastructure (EGI) Federated Cloud
    corecore