37 research outputs found

    Cookery: A Framework for Creating Data Processing Pipeline Using Online Services

    Get PDF
    With the increasing amount of data the importance of data analysis in various scientific domains has grown. A large amount of the scientific data has shifted to cloud based storage. The cloud offers storage and computation power. The Cookery framework is a tool developed to build scientific applications using cloud services. In this paper we present the Cookery systems and how they can be used to authenticate and use standard online third party services to easily create data analytic pipelines. Cookery framework is not limited to work with standard web services; it can also integrate and work with the emerging AWS Lambda which is part of a new computing paradigm, collectively, known as serverless computing. The combination of AWS Lambda and Cookery, which makes it possible for users in many scientific domains, who do not have any program experience, to create data processing pipelines using cloud services in a short time

    Reference Exascale Architecture (Extended Version)

    Get PDF
    While political commitments for building exascale systems have been made, turning these systems into platforms for a wide range of exascale applications faces several technical, organisational and skills-related challenges. The key technical challenges are related to the availability of data. While the first exascale machines are likely to be built within a single site, the input data is in many cases impossible to store within a single site. Alongside handling of extreme-large amount of data, the exascale system has to process data from different sources, support accelerated computing, handle high volume of requests per day, minimize the size of data flows, and be extensible in terms of continuously increasing data as well as an increase in parallel requests being sent. These technical challenges are addressed by the general reference exascale architecture. It is divided into three main blocks: virtualization layer, distributed virtual file system, and manager of computing resources. Its main property is modularity which is achieved by containerization at two levels: 1) application containers - containerization of scientific workflows, 2) micro-infrastructure - containerization of extreme-large data service-oriented infrastructure. The paper also presents an instantiation of the reference architecture - the architecture of the PROCESS project (PROviding Computing solutions for ExaScale ChallengeS) and discusses its relation to the reference exascale architecture. The PROCESS architecture has been used as an exascale platform within various exascale pilot applications. This paper also presents performance modelling of exascale platform with its validation

    PROCESS Data Infrastructure and Data Services

    Get PDF
    Due to energy limitation and high operational costs, it is likely that exascale computing will not be achieved by one or two datacentres but will require many more. A simple calculation, which aggregates the computation power of the 2017 Top500 supercomputers, can only reach 418 petaflops. Companies like Rescale, which claims 1.4 exaflops of peak computing power, describes its infrastructure as composed of 8 million servers spread across 30 datacentres. Any proposed solution to address exascale computing challenges has to take into consideration these facts and by design should aim to support the use of geographically distributed and likely independent datacentres. It should also consider, whenever possible, the co-allocation of the storage with the computation as it would take 3 years to transfer 1 exabyte on a dedicated 100 Gb Ethernet connection. This means we have to be smart about managing data more and more geographically dispersed and spread across different administrative domains. As the natural settings of the PROCESS project is to operate within the European Research Infrastructure and serve the European research communities facing exascale challenges, it is important that PROCESS architecture and solutions are well positioned within the European computing and data management landscape namely PRACE, EGI, and EUDAT. In this paper we propose a scalable and programmable data infrastructure that is easy to deploy and can be tuned to support various data-intensive scientific applications

    Toward Executable Scientific Publications

    Get PDF
    AbstractReproducibility of experiments is considered as one of the main principles of the scientific method. Recent developments in data and computation intensive science, i.e. e-Science, and state of the art in Cloud computing provide the necessary components to preserve data sets and re-run code and software that create research data. The Executable Paper (EP) concept uses state of the art technology to include data sets, code, and software in the electronic publication such that readers can validate the presented results. In this paper we present how to advance current state of the art to preserve, data sets, code, and software that create research data, the basic components of an execution platform to preserve long term compatibility of EP, and we identify a number of issues and challenges in the realization of EP

    matchms - processing and similarity evaluation of mass spectrometry data

    Get PDF
    Mass spectrometry data is at the heart of numerous applications in the biomedical and lifesciences. With growing use of high-throughput techniques, researchers need to analyze largerand more complex datasets. In particular through joint effort in the research community,fragmentation mass spectrometry datasets are growing in size and number. Platforms such asMassBank (Horai et al., 2010), GNPS (Wang et al., 2016) or MetaboLights (Haug et al., 2020)serve as an open-access hub for sharing of raw, processed, or annotated fragmentation massspectrometry data. Without suitable tools, however, exploitation of such datasets remainsoverly challenging. In particular, large collected datasets contain data acquired using differentinstruments and measurement conditions, and can further contain a significant fraction ofinconsistent, wrongly labeled, or incorrect metadata (annotations)

    Scientific Workflows

    No full text

    EDISON Data Science Framework: Part 4. Data Science Professional profiles (DSP profiles) Release 2

    No full text
    This document presents the results of the research and development in the EDISON project to define the Data Science Professional profiles (DSPP) that is important for defining the Data Scientist roles in the organisation and their alignment with the organizational goals and mission. The Data Science Professional profiles definition is done in the context of the whole EDISON Data Science Framework. The proposed DSP profiles are defined as an extension to current ESCO (European Skills, Competences, Qualifications and Occupations) taxonomy and is intended to be proposed for formal inclusion of the new Data Science professions family into the future ESCO taxonomy edition. The proposed DSP profiles when adopted by the community will have multiple uses. First of all. they will help organisations to plan their staffing for data related functions when migrating to agile data driven organizational model. The Human Resource (HR) departments can effectively use DSP profiles for vacancy description construction and candidates assessment. When used together with CF-DS, the DSP profiles can provide a basis for building interactive/web based tool for individual competences benchmarking against selected (or desirable) professional profiles as well as advising practitioners on the (up/re-) skilling path

    EDISON Data Science Framework: Part 3. Data Science Model Curriculum (MC-DS) Release 2

    No full text
    The Data Science Model Curriculum (MC-DS) is a part of the EDISON Data Science Framework (EDSF) and is a product of the EDISON Project. The MC-DS is built based on CF-DS and DS-BoK, where Learning Outcomes are defined based on CF-DS competences and Learning Units are mapped to Knowledge Units in DS-BoK. In its own turn, Learning Units are defined based on the ACM Classification of Computer Science (CCS2012) and reflect typical courses naming used by universities in their current programmes. The suggested Learning Units are assigned identifying labels, marking their relevance to the core Data Science knowledge areas in a form of Tier 1, Tier 2, or Elective courses. Further MC-DS refinement will be done based on consultation with the universities community and experts both in Data Science and scientific or industry domains. The proposed MC-DS intends to provide guidance to universities and training organisations in the construction of Data Science programmes and individual courses selection that are balanced according to requirements elicited from the research and industry domains. MC-DS can be used for assessment and improvement of existing Data Science programmes with respect to the knowledge areas and competence groups that are associated with specific professional profiles. When coupled with individual or group competence benchmarking, MC-DS can also be used for building individual training curricula and professional (self/up) skilling for effective career management
    corecore