798 research outputs found

    Integration of Data Mining into Scientific Data Analysis Processes

    Get PDF
    In recent years, using advanced semi-interactive data analysis algorithms such as those from the field of data mining gained more and more importance in life science in general and in particular in bioinformatics, genetics, medicine and biodiversity. Today, there is a trend away from collecting and evaluating data in the context of a specific problem or study only towards extensively collecting data from different sources in repositories which is potentially useful for subsequent analysis, e.g. in the Gene Expression Omnibus (GEO) repository of high throughput gene expression data. At the time the data are collected, it is analysed in a specific context which influences the experimental design. However, the type of analyses that the data will be used for after they have been deposited is not known. Content and data format are focused only to the first experiment, but not to the future re-use. Thus, complex process chains are needed for the analysis of the data. Such process chains need to be supported by the environments that are used to setup analysis solutions. Building specialized software for each individual problem is not a solution, as this effort can only be carried out for huge projects running for several years. Hence, data mining functionality was developed to toolkits, which provide data mining functionality in form of a collection of different components. Depending on the different research questions of the users, the solutions consist of distinct compositions of these components. Today, existing solutions for data mining processes comprise different components that represent different steps in the analysis process. There exist graphical or script-based toolkits for combining such components. The data mining tools, which can serve as components in analysis processes, are based on single computer environments, local data sources and single users. However, analysis scenarios in medical- and bioinformatics have to deal with multi computer environments, distributed data sources and multiple users that have to cooperate. Users need support for integrating data mining into analysis processes in the context of such scenarios, which lacks today. Typically, analysts working with single computer environments face the problem of large data volumes since tools do not address scalability and access to distributed data sources. Distributed environments such as grid environments provide scalability and access to distributed data sources, but the integration of existing components into such environments is complex. In addition, new components often cannot be directly developed in distributed environments. Moreover, in scenarios involving multiple computers, multiple distributed data sources and multiple users, the reuse of components, scripts and analysis processes becomes more important as more steps and configuration are necessary and thus much bigger efforts are needed to develop and set-up a solution. In this thesis we will introduce an approach for supporting interactive and distributed data mining for multiple users based on infrastructure principles that allow building on data mining components and processes that are already available instead of designing of a completely new infrastructure, so that users can keep working with their well-known tools. In order to achieve the integration of data mining into scientific data analysis processes, this thesis proposes an stepwise approach of supporting the user in the development of analysis solutions that include data mining. We see our major contributions as the following: first, we propose an approach to integrate data mining components being developed for a single processor environment into grid environments. By this, we support users in reusing standard data mining components with small effort. The approach is based on a metadata schema definition which is used to grid-enable existing data mining components. Second, we describe an approach for interactively developing data mining scripts in grid environments. The approach efficiently supports users when it is necessary to enhance available components, to develop new data mining components, and to compose these components. Third, building on that, an approach for facilitating the reuse of existing data mining processes based on process patterns is presented. It supports users in scenarios that cover different steps of the data mining process including several components or scripts. The data mining process patterns support the description of data mining processes at different levels of abstraction between the CRISP model as most general and executable workflows as most concrete representation

    Dependable workflow management system for smart farms

    Get PDF
    Smart Farming is a new and emerging domain representing the application of modern technologies into agriculture, leading to a revolution of this classic domain. CLUeFARM is a web platform in the domain of smart farming which main purpose is to help farmers to easily manage and supervise their farms from any device connected to the Internet, offering some useful services. Cloud technologies evolved a lot in recent years and based on this growth, microservices are more and more used. If for the server side, the scalability and reusability are solved in high proportion by microservices, on the client side of web applications, there was no independent solution until the recent emergence of web components. They can be seen as the microservices of the front-end. Microservices and web components are usually used isolated one of each other. This paper proposes and presents the functionality and implementation of a dependable workflow management service by using an end-to-end microservices approach

    Report from GI-Dagstuhl Seminar 16394: Software Performance Engineering in the DevOps World

    Get PDF
    This report documents the program and the outcomes of GI-Dagstuhl Seminar 16394 "Software Performance Engineering in the DevOps World". The seminar addressed the problem of performance-aware DevOps. Both, DevOps and performance engineering have been growing trends over the past one to two years, in no small part due to the rise in importance of identifying performance anomalies in the operations (Ops) of cloud and big data systems and feeding these back to the development (Dev). However, so far, the research community has treated software engineering, performance engineering, and cloud computing mostly as individual research areas. We aimed to identify cross-community collaboration, and to set the path for long-lasting collaborations towards performance-aware DevOps. The main goal of the seminar was to bring together young researchers (PhD students in a later stage of their PhD, as well as PostDocs or Junior Professors) in the areas of (i) software engineering, (ii) performance engineering, and (iii) cloud computing and big data to present their current research projects, to exchange experience and expertise, to discuss research challenges, and to develop ideas for future collaborations

    The Open Science Commons for the European Research Area

    Get PDF
    Nowadays, research practice in all scientific disciplines is increasingly, and in many cases exclusively, data driven. Knowledge of how to use tools to manipulate research data, and the availability of e-Infrastructures to support them for data storage, processing, analysis and preservation, is fundamental. In parallel, new types of communities are forming around interests in digital tools, computing facilities and data repositories. By making infrastructure services, community engagement and training inseparable, existing communities can be empowered by new ways of doing research, and new communities can be created around tools and data. Europe is ideally positioned to become a world leader as provider of research data for the benefit of research communities and the wider economy and society. Europe would benefit from an integrated infrastructure where data and computing services for big data can be easily shared and reused. This is particularly challenging in EO given the volumes and variety of the data that make scalable access difficult, if not impossible, to individual researchers and small groups (i.e. to the so-called long tail of science). To overcome this limitation, as part of the European Commission Digital Single Market strategy, the European Open Science Cloud (EOSC) initiative was launched in April 2016, with the final aim to realise the European Research Area (ERA) and raise research to the next level. It promotes not only scientific excellence and data reuse, but also job growth and increased competitiveness in Europe, and results in Europe-wide cost efficiencies in scientific infrastructure through the promotion of interoperability on an unprecedented scale. This chapter analyses existing barriers to achieve this aim and proposes the Open Science Commons as the fundamental principles to create an EOSC able to offer an integrated infrastructure for the depositing, sharing and reuse of big data, including Earth Observation (EO) data, leveraging and enhancing the current e-Infrastructure landscape, through standardization, interoperability, policy and governance. Finally, it is shown how an EOSC built on e-Infrastructures can improve the discovery, retrieval and processing capabilities of EO data, offering virtualised access to geographically distributed data and the computing necessary to manipulate and manage large volumes. Well-established e-Infrastructure services could provide a set of reusable components to accelerate the development of exploitation platforms for satellite data solving common problems, such as user authentication and authorisation, monitoring or accounting

    Innovations with Smart Service Systems: Analytics, Big Data, Cognitive Assistance, and the Internet of Everything

    Get PDF
    Service innovations, enabled by the confluence of big data, mobile solutions, cloud, social, and cognitive computing, and the Internet of Things, have gained a lot of attention among many enterprises in the past few years because they represent promising ways for companies to effectively and rapidly deliver new services. But one of today\u27s most pervasive and bedeviling challenges is how to start this journey and stay on course. In this paper, we review some of the important developments in this area and reports the views voiced by five industry leaders from IBM, Cisco, HP, and ISSIP at a panel session at the 24th Annual Compete through Service Symposium in 2013. Panelists provided an extensive list of recommendations to academicians and professionals. The biggest conclusion is that all of the information and communications technology (ICT)-enabled service innovations need to be human-centered and focused on co-creating value

    A Selection of Research Data Management Tools Throughout the Data Lifecycle

    Get PDF
    In this document, several useful research data management tools are listed and described for each step of their research throughout the data lifecyle management

    Metadata and provenance management

    Get PDF
    Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata and provenance in the context of the data lifecycle. It also gives an overview of the approaches to metadata and provenance management, followed by examples of how applications use metadata and provenance in their scientific processes
    • …
    corecore