40 research outputs found

    Bringing scientific workflows to Amazon SWF

    Get PDF
    In response to the ever-increasing needs of scientific applications for resources, Cloud computing emerged as an alternative on-demand and cost-effective resource provisioning approach. In this context, Cloud providers have recognised the importance of workflow applications to science and provide their own native solutions, such as the Amazon Simple Workflow Service (SWF). Nevertheless, an important downside of SWF is its incompatibility with existing workflow systems, and lack of means for reusing scientific legacy code. Similarly, existing workflow middlewares and applications require non-trivial extensions to take advantage of Cloud resources. We present in this paper a software engineering solution that allows the scientific workflow community access the Amazon Cloud through one single front-end converter, and propose a legacy wrapper service for executing legacy code using SWF. Empirical results using a real-world scientific workflow demonstrate that our automatically generated SWF application performs almost as fast as a native manually-optimised version, and outperforms other workflow middleware systems using the Amazon Cloud.(VLID)2199150Accepted versio

    Semantic scheduling of virtualized infrastructures for scientific workflows

    Get PDF
    Virtualized Infrastructures are a promising way for providing flexible and dynamic computing solutions for resourceconsuming tasks. Scientific Workflows are one of these kind of tasks, as they need a large amount of computational resources during certain periods of time. To provide the best infrastructure configuration for a workflow it is necessary to explore as many providers as possible taking into account different criteria like Quality of Service, pricing, response time, network latency, etc. Moreover, each one of these new resources must be tuned to provide the tools and dependencies required by each of the steps of the workflow. Working with different infrastructure providers, either public or private using their own concepts and terms, and with a set of heterogeneous applications requires a framework for integrating all the information about these elements. This work proposes semantic technologies for describing and integrating all the information about the different components of the overall system and a set of policies created by the user. Based on this information a scheduling process will be performed to generate an infrastructure configuration defining the set of virtual machines that must be run and the tools that must be deployed on them

    Interacting with scientific workflows

    Get PDF

    BlogForever: D3.1 Preservation Strategy Report

    Get PDF
    This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design

    Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle

    Get PDF
    The amount of produced data, either in the scientific community or the commercialworld, is constantly growing. The field of Big Data has emerged to handle largeamounts of data on distributed computing infrastructures. High-Performance Computing (HPC) infrastructures are traditionally used for the execution of computeintensive workloads. However, the HPC community is also facing an increasingneed to process large amounts of data derived from high definition sensors andlarge physics apparati. The convergence of the two fields -HPC and Big Data- iscurrently taking place. In fact, the HPC community already uses Big Data tools,which are not always integrated correctly, especially at the level of the file systemand the Resource and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, andwhat are the challenges for the HPC infrastructures, we have studied multipleaspects of the convergence: We initially provide a survey on the software provisioning methods, with a focus on data-intensive applications. We contribute a newRJMS collaboration technique called BeBiDa which is based on 50 lines of codewhereas similar solutions use at least 1000 times more. We evaluate this mechanism on real conditions and in simulated environment with our simulator Batsim.Furthermore, we provide extensions to Batsim to support I/O, and showcase thedevelopments of a generic file system model along with a Big Data applicationmodel. This allows us to complement BeBiDa real conditions experiments withsimulations while enabling us to study file system dimensioning and trade-offs.All the experiments and analysis of this work have been done with reproducibilityin mind. Based on this experience, we propose to integrate the developmentworkflow and data analysis in the reproducibility mindset, and give feedback onour experiences with a list of best practices.RésuméLa quantité de données produites, que ce soit dans la communauté scientifiqueou commerciale, est en croissance constante. Le domaine du Big Data a émergéface au traitement de grandes quantités de données sur les infrastructures informatiques distribuées. Les infrastructures de calcul haute performance (HPC) sont traditionnellement utilisées pour l’exécution de charges de travail intensives en calcul. Cependant, la communauté HPC fait également face à un nombre croissant debesoin de traitement de grandes quantités de données dérivées de capteurs hautedéfinition et de grands appareils physique. La convergence des deux domaines-HPC et Big Data- est en cours. En fait, la communauté HPC utilise déjà des outilsBig Data, qui ne sont pas toujours correctement intégrés, en particulier au niveaudu système de fichiers ainsi que du système de gestion des ressources (RJMS).Afin de comprendre comment nous pouvons tirer parti des clusters HPC pourl’utilisation du Big Data, et quels sont les défis pour les infrastructures HPC, nousavons étudié plusieurs aspects de la convergence: nous avons d’abord proposé uneétude sur les méthodes de provisionnement logiciel, en mettant l’accent sur lesapplications utilisant beaucoup de données. Nous contribuons a l’état de l’art avecune nouvelle technique de collaboration entre RJMS appelée BeBiDa basée sur 50lignes de code alors que des solutions similaires en utilisent au moins 1000 fois plus.Nous évaluons ce mécanisme en conditions réelles et en environnement simuléavec notre simulateur Batsim. En outre, nous fournissons des extensions à Batsimpour prendre en charge les entrées/sorties et présentons le développements d’unmodèle de système de fichiers générique accompagné d’un modèle d’applicationBig Data. Cela nous permet de compléter les expériences en conditions réellesde BeBiDa en simulation tout en étudiant le dimensionnement et les différentscompromis autours des systèmes de fichiers.Toutes les expériences et analyses de ce travail ont été effectuées avec la reproductibilité à l’esprit. Sur la base de cette expérience, nous proposons d’intégrerle flux de travail du développement et de l’analyse des données dans l’esprit dela reproductibilité, et de donner un retour sur nos expériences avec une liste debonnes pratiques

    Crowd-supervised training of spoken language systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-166).Spoken language systems are often deployed with static speech recognizers. Only rarely are parameters in the underlying language, lexical, or acoustic models updated on-the-fly. In the few instances where parameters are learned in an online fashion, developers traditionally resort to unsupervised training techniques, which are known to be inferior to their supervised counterparts. These realities make the development of spoken language interfaces a difficult and somewhat ad-hoc engineering task, since models for each new domain must be built from scratch or adapted from a previous domain. This thesis explores an alternative approach that makes use of human computation to provide crowd-supervised training for spoken language systems. We explore human-in-the-loop algorithms that leverage the collective intelligence of crowds of non-expert individuals to provide valuable training data at a very low cost for actively deployed spoken language systems. We also show that in some domains the crowd can be incentivized to provide training data for free, as a byproduct of interacting with the system itself. Through the automation of crowdsourcing tasks, we construct and demonstrate organic spoken language systems that grow and improve without the aid of an expert. Techniques that rely on collecting data remotely from non-expert users, however, are subject to the problem of noise. This noise can sometimes be heard in audio collected from poor microphones or muddled acoustic environments. Alternatively, noise can take the form of corrupt data from a worker trying to game the system - for example, a paid worker tasked with transcribing audio may leave transcripts blank in hopes of receiving a speedy payment. We develop strategies to mitigate the effects of noise in crowd-collected data and analyze their efficacy. This research spans a number of different application domains of widely-deployed spoken language interfaces, but maintains the common thread of improving the speech recognizer's underlying models with crowd-supervised training algorithms. We experiment with three central components of a speech recognizer: the language model, the lexicon, and the acoustic model. For each component, we demonstrate the utility of a crowd-supervised training framework. For the language model and lexicon, we explicitly show that this framework can be used hands-free, in two organic spoken language systems.by Ian C. McGraw.Ph.D
    corecore