57 research outputs found

    Legacy code support for production grids

    Get PDF
    In order to improve reliability and to deal with the high complexity of existing middleware solutions, today's production Grid systems restrict the services to be deployed on their resources. On the other hand end-users require a wide range of value added services to fully utilize these resources. This paper describes a solution how legacy code support is offered as third party service for production Grids. The introduced solution, based on the Grid Execution Management for Legacy Code Architecture (GEMLCA), do not require the deployment of additional applications on the Grid resources, or any extra effort from Grid system administrators. The implemented solution was successfully connected to and demonstrated on the UK National Grid Service. © 2005 IEEE

    Integration of Data Mining into Scientific Data Analysis Processes

    Get PDF
    In recent years, using advanced semi-interactive data analysis algorithms such as those from the field of data mining gained more and more importance in life science in general and in particular in bioinformatics, genetics, medicine and biodiversity. Today, there is a trend away from collecting and evaluating data in the context of a specific problem or study only towards extensively collecting data from different sources in repositories which is potentially useful for subsequent analysis, e.g. in the Gene Expression Omnibus (GEO) repository of high throughput gene expression data. At the time the data are collected, it is analysed in a specific context which influences the experimental design. However, the type of analyses that the data will be used for after they have been deposited is not known. Content and data format are focused only to the first experiment, but not to the future re-use. Thus, complex process chains are needed for the analysis of the data. Such process chains need to be supported by the environments that are used to setup analysis solutions. Building specialized software for each individual problem is not a solution, as this effort can only be carried out for huge projects running for several years. Hence, data mining functionality was developed to toolkits, which provide data mining functionality in form of a collection of different components. Depending on the different research questions of the users, the solutions consist of distinct compositions of these components. Today, existing solutions for data mining processes comprise different components that represent different steps in the analysis process. There exist graphical or script-based toolkits for combining such components. The data mining tools, which can serve as components in analysis processes, are based on single computer environments, local data sources and single users. However, analysis scenarios in medical- and bioinformatics have to deal with multi computer environments, distributed data sources and multiple users that have to cooperate. Users need support for integrating data mining into analysis processes in the context of such scenarios, which lacks today. Typically, analysts working with single computer environments face the problem of large data volumes since tools do not address scalability and access to distributed data sources. Distributed environments such as grid environments provide scalability and access to distributed data sources, but the integration of existing components into such environments is complex. In addition, new components often cannot be directly developed in distributed environments. Moreover, in scenarios involving multiple computers, multiple distributed data sources and multiple users, the reuse of components, scripts and analysis processes becomes more important as more steps and configuration are necessary and thus much bigger efforts are needed to develop and set-up a solution. In this thesis we will introduce an approach for supporting interactive and distributed data mining for multiple users based on infrastructure principles that allow building on data mining components and processes that are already available instead of designing of a completely new infrastructure, so that users can keep working with their well-known tools. In order to achieve the integration of data mining into scientific data analysis processes, this thesis proposes an stepwise approach of supporting the user in the development of analysis solutions that include data mining. We see our major contributions as the following: first, we propose an approach to integrate data mining components being developed for a single processor environment into grid environments. By this, we support users in reusing standard data mining components with small effort. The approach is based on a metadata schema definition which is used to grid-enable existing data mining components. Second, we describe an approach for interactively developing data mining scripts in grid environments. The approach efficiently supports users when it is necessary to enhance available components, to develop new data mining components, and to compose these components. Third, building on that, an approach for facilitating the reuse of existing data mining processes based on process patterns is presented. It supports users in scenarios that cover different steps of the data mining process including several components or scripts. The data mining process patterns support the description of data mining processes at different levels of abstraction between the CRISP model as most general and executable workflows as most concrete representation

    Federated Query Processing for the Semantic Web

    Get PDF
    The recent years have witnessed a constant growth in the amount of RDF data available on the Web. This growth is largely based on the increasing rate of data publication on the Web by different actors such governments, life science researchers or geographical institutes. RDF data generation is mainly done by converting already existing legacy data resources into RDF (e.g. converting data stored in relational databases into RDF), but also by creating that RDF data directly (e.g. sensors). These RDF data are normally exposed by means of Linked Data-enabled URIs and SPARQL endpoints. Given the sustained growth that we are experiencing in the number of SPARQL endpoints available, the need to be able to send federated SPARQL queries across them has also grown. Tools for accessing sets of RDF data repositories are starting to appear, differing between them on the way in which they allow users to access these data (allowing users to specify directly what RDF data set they want to query, or making this process transparent to them). To overcome this heterogeneity in federated query processing solutions, the W3C SPARQL working group is defining a federation extension for SPARQL 1.1, which allows combining in a single query, graph patterns that can be evaluated in several endpoints. In this PhD thesis, we describe the syntax of that SPARQL extension for providing access to distributed RDF data sets and formalise its semantics. We adapt existing techniques for distributed data access in relational databases in order to deal with SPARQL endpoints, which we have implemented in our federation query evaluation system (SPARQL-DQP). We describe the static optimisation techniques that we implemented in our system and we carry out a series of experiments that show that our optimisations significantly speed up the query evaluation process in presence of large query results and optional operator

    Interoperability of heterogeneous large-scale scientific workflows and data resources

    Get PDF
    Workflow allows e-Scientists to express their experimental processes in a structured way and provides a glue to integrate remote applications. Since Grid provides an enormously large amount of data and computational resources, executing workflows on the Grid results in significant performance improvement. Several workflow management systems, which are widely used by different scientific communities, were developed for various purposes. Therefore, they differ in several aspects. This thesis outlines two major problems of existing workflow systems: workflow interoperability and data access. On the one hand, existing workflow systems are based on different technologies. Therefore, to achieve interoperability between their workflows at any level is a challenging task. In spite of the fact that there is a clear demand for interoperable workflows, for example, to enable scientists to share workflows, to leverage existing work of others, and to create multi-disciplinary workflows; currently, there are only limited, ad-hoc workflow interoperability solutions available for scientists. Existing solutions only realise workflow interoperability between a small set of workflow systems and do not consider performance issues that arise in the case of large-scale (computational and/or data intensive) scientific workflows. Scientific workflows are typically computation and/or data intensive and are executed in a distributed environment to speed up their execution time. Therefore, their performance is a key issue. Existing interoperability solutions bottleneck the communication between workflows in most scenarios dramatically increasing execution time. On the other hand, many scientific computational experiments are based on data that reside in data resources which can be of different types and vendors. Many workflow systems support access to limited subsets of such data resources preventing data level workflow interoperation between different systems. Therefore, there is a demand for a general solution that provides access to a wide range of data resources of different types and vendors. If such a solution is general, in the sense that it can be adopted by several workflow systems, then it also enables workflows of different systems to access the same data resources and therefore interoperate at data level. Note that data semantics are out of the scope of this work. For the same reasons as described above, the performance characteristics of such a solution are inevitably important. Although in terms of functionality, there are solutions which could be adopted by workflow systems for this purpose, they provide poor performance. For that reason, they did not gain wide acceptance by the scientific workflow community. Addressing these issues, a set of architectures is proposed to realise heterogeneous data access and heterogeneous workflow execution solutions. The primary goal was to investigate how such solutions can be implemented and integrated with workflow systems. The secondary aim was to analyse how such solutions can be implemented and utilised by single applications

    Design of a Workflow-Based Grid Framework

    Full text link
    This paper aims to present the design of the Grid Collaborative Framework which has been proposed in one of our previous work. Grid infrastructure for resources sharing is somewhat stable with the wide acceptance of the Open Grid Services Architecture (OGSA) and Web Services Resource Framework (WSRF), but Grid framework for collaboration is far from desired. Current Grid Collaborative Frameworks (GCFs) are domain specific and lack of plan-supported capability. These limitations make them less useful and narrow in scope of application. Our grid collaborative framework aims to improve these limitations. With the theoretical foundation based on the activity theory, workflow languages, and designed on top of existing OGSA infrastructure, our proposed framework aims at accelerating the development of grid collaborative systems that consider work plans as central role

    3rd EGEE User Forum

    Get PDF
    We have organized this book in a sequence of chapters, each chapter associated with an application or technical theme introduced by an overview of the contents, and a summary of the main conclusions coming from the Forum for the chapter topic. The first chapter gathers all the plenary session keynote addresses, and following this there is a sequence of chapters covering the application flavoured sessions. These are followed by chapters with the flavour of Computer Science and Grid Technology. The final chapter covers the important number of practical demonstrations and posters exhibited at the Forum. Much of the work presented has a direct link to specific areas of Science, and so we have created a Science Index, presented below. In addition, at the end of this book, we provide a complete list of the institutes and countries involved in the User Forum

    Optimisation of the enactment of fine-grained distributed data-intensive work flows

    Get PDF
    The emergence of data-intensive science as the fourth science paradigm has posed a data deluge challenge for enacting scientific work-flows. The scientific community is facing an imminent flood of data from the next generation of experiments and simulations, besides dealing with the heterogeneity and complexity of data, applications and execution environments. New scientific work-flows involve execution on distributed and heterogeneous computing resources across organisational and geographical boundaries, processing gigabytes of live data streams and petabytes of archived and simulation data, in various formats and from multiple sources. Managing the enactment of such work-flows not only requires larger storage space and faster machines, but the capability to support scalability and diversity of the users, applications, data, computing resources and the enactment technologies. We argue that the enactment process can be made efficient using optimisation techniques in an appropriate architecture. This architecture should support the creation of diversified applications and their enactment on diversified execution environments, with a standard interface, i.e. a work-flow language. The work-flow language should be both human readable and suitable for communication between the enactment environments. The data-streaming model central to this architecture provides a scalable approach to large-scale data exploitation. Data-flow between computational elements in the scientific work-flow is implemented as streams. To cope with the exploratory nature of scientific work-flows, the architecture should support fast work-flow prototyping, and the re-use of work-flows and work-flow components. Above all, the enactment process should be easily repeated and automated. In this thesis, we present a candidate data-intensive architecture that includes an intermediate work-flow language, named DISPEL. We create a new fine-grained measurement framework to capture performance-related data during enactments, and design a performance database to organise them systematically. We propose a new enactment strategy to demonstrate that optimisation of data-streaming work-flows can be automated by exploiting performance data gathered during previous enactments

    Understanding Semantic Aware Grid Middleware for e-Science

    Get PDF
    In this paper we analyze several semantic-aware Grid middleware services used in e-Science applications. We describe them according to a common analysis framework, so as to find their commonalities and their distinguishing features. As a result of this analysis we categorize these services into three groups: information services, data access services and decision support services. We make comparisons and provide additional conclusions that are useful to understand better how these services have been developed and deployed, and how similar services would be developed in the future, mainly in the context of e-Science applications
    corecore