144 research outputs found

    How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface

    Full text link
    Scientific workflow management systems (SWMSs) and resource managers together ensure that tasks are scheduled on provisioned resources so that all dependencies are obeyed, and some optimization goal, such as makespan minimization, is fulfilled. In practice, however, there is no clear separation of scheduling responsibilities between an SWMS and a resource manager because there exists no agreed-upon separation of concerns between their different components. This has two consequences. First, the lack of a standardized API to exchange scheduling information between SWMSs and resource managers hinders portability. It incurs costly adaptations when a component should be replaced by another one (e.g., an SWMS with another SWMS on the same resource manager). Second, due to overlapping functionalities, current installations often actually have two schedulers, both making partial scheduling decisions under incomplete information, leading to suboptimal workflow scheduling. In this paper, we propose a simple REST interface between SWMSs and resource managers, which allows any SWMS to pass dynamic workflow information to a resource manager, enabling maximally informed scheduling decisions. We provide an exemplary implementation of this API for Nextflow as an SWMS and Kubernetes as a resource manager. Our experiments with nine real-world workflows show that this strategy reduces makespan by up to 25.1% and 10.8% on average compared to the standard Nextflow/Kubernetes configuration. Furthermore, a more widespread implementation of this API would enable leaner code bases, a simpler exchange of components of workflow systems, and a unified place to implement new scheduling algorithms.Comment: Paper accepted in: 2023 23rd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid

    High performance computing in the cloud

    Get PDF
    In recent years, the interest in both scientific and business workflows has increased. A workflow is composed of a series of tools, which should be executed in a predefined order to perform an analysis. Traditionally, these workflows were executed in a manual way, sending the output of one tool to the next one in the analysis process. Many applications to execute workflows automatically, appeared recently. These applications ease the work of the users while executing their analyses. In addition, from the computational point of view, some workflows require a significant amount of resources. Consequently, workflow execution moved from single workstations to distributed environments such as Grids or Clouds. Data management and tasks scheduling are required to execute workflows in an efficient way in such environments. In this thesis, we propose a cloud-based HPC environment, focusing on tasks scheduling, resources auto-scaling, data management and simplifying the access to the resources with software clients. First, the cloud computing infrastructure is devised, which includes the base software (i.e. OpenStack) plus several additional modules aimed at improving authentication (i.e. LDAP) and data management (i.e. GridFTP, Globus Online and CloudFuse). Second, built on top of the mentioned infrastructure, the TORQUE distributed resources manager and the Maui scheduler have been configured to schedule and distribute tasks to the cloud-based workers. To reduce the number of idle nodes and the incurred cost of the active cloud resources, we also propose a configurable auto-scaling technique, which is able to scale the execution cluster depending on the workload. Additionally, in order to simplify tasks submission to the TORQUE execution cluster, we have interconnected the Galaxy workflows management system with it, therefore users benefit from a simple way to execute their tasks. Finally, we conducted an experimental evaluation, composed by a number of different studies with synthetic and real-world applications, to show the behaviour of the auto-scaled execution cluster managed by TORQUE and Maui. All experiments have been performed by using an OpenStack cloud computing environment and the benchmarked applications correspond to the benchmarking suite, which is specially designed for workflows scheduling in the cloud computing environment. Cybershake, Ligo and Montage have been the selected synthetic applications from the benchmarking suite. GECKO and a GWAS pipeline represent the real-world test use cases, both having a diverse and heterogeneous set of tasks.The numerous technological advances in data acquisition techniques allow the massive production of enormous amounts of data in diverse fields such as astronomy, health and social networks. Nowadays, only a small part of this data can be analysed because of the lack of computational resources. High Performance Computing (HPC) strategies represent the single choice to analyse such overwhelming amount of data. However, in general, HPC techniques require the use of big and expensive computing and storage infrastructures, usually not affordable or available for most users. Cloud computing, where users pay for the resources they need and when they actually need them, appears as an interesting alternative. Besides the savings in hardware infrastructure, cloud computing offers further advantages such as the removal of installation, administration and supplying requirements. In addition, it enables users to use better hardware than the one they can usually afford, scale the resources depending on their needs, and a greater fault-tolerance, amongst others. The efficient utilisation of HPC resources becomes a fundamental task, particularly in cloud computing. We need to consider the cost of using HPC resources, specially in the case of cloud-based infrastructures, where users have to pay for storing, transferring and analysing data. Therefore, it is really important the usage of generic tasks scheduling and auto-scaling techniques to efficiently exploit the computational resources. It is equally important to make these tasks user-friendly through the development of tools/applications (software clients), which act as interface between the user and the infrastructure

    Visualization and exploration of next-generation proteomics data

    Get PDF

    Enriching information extraction pipelines in clinical decision support systems

    Get PDF
    Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01[Resumo] Os estudos sanitarios de múltiples centros son importantes para aumentar a repercusión dos resultados da investigación médica debido ao número de suxeitos que poden participar neles. Para simplificar a execución destes estudos, o proceso de intercambio de datos debería ser sinxelo, por exemplo, mediante o uso de bases de datos interoperables. Con todo, a consecución desta interoperabilidade segue sendo un tema de investigación en curso, sobre todo debido aos problemas de gobernanza e privacidade dos datos. Na primeira fase deste traballo, propoñemos varias metodoloxías para optimizar os procesos de estandarización das bases de datos sanitarias. Este traballo centrouse na estandarización de fontes de datos heteroxéneas nun esquema de datos estándar, concretamente o OMOP CDM, que foi desenvolvido e promovido pola comunidade OHDSI. Validamos a nosa proposta utilizando conxuntos de datos de pacientes con enfermidade de Alzheimer procedentes de distintas institucións. Na seguinte etapa, co obxectivo de enriquecer a información almacenada nas bases de datos de OMOP CDM, investigamos solucións para extraer conceptos clínicos de narrativas non estruturadas, utilizando técnicas de recuperación de información e de procesamento da linguaxe natural. A validación realizouse a través de conxuntos de datos proporcionados en desafíos científicos, concretamente no National NLP Clinical Challenges(n2c2). Na etapa final, propuxémonos simplificar a execución de protocolos de estudos provenientes de múltiples centros, propoñendo solucións novas para perfilar, publicar e facilitar o descubrimento de bases de datos. Algunhas das solucións desenvolvidas están a utilizarse actualmente en tres proxectos europeos destinados a crear redes federadas de bases de datos de saúde en toda Europa.[Resumen] Los estudios sanitarios de múltiples centros son importantes para aumentar la repercusión de los resultados de la investigación médica debido al número de sujetos que pueden participar en ellos. Para simplificar la ejecución de estos estudios, el proceso de intercambio de datos debería ser sencillo, por ejemplo, mediante el uso de bases de datos interoperables. Sin embargo, la consecución de esta interoperabilidad sigue siendo un tema de investigación en curso, sobre todo debido a los problemas de gobernanza y privacidad de los datos. En la primera fase de este trabajo, proponemos varias metodologías para optimizar los procesos de estandarización de las bases de datos sanitarias. Este trabajo se centró en la estandarización de fuentes de datos heterogéneas en un esquema de datos estándar, concretamente el OMOP CDM, que ha sido desarrollado y promovido por la comunidad OHDSI. Validamos nuestra propuesta utilizando conjuntos de datos de pacientes con enfermedad de Alzheimer procedentes de distintas instituciones. En la siguiente etapa, con el objetivo de enriquecer la información almacenada en las bases de datos de OMOP CDM, hemos investigado soluciones para extraer conceptos clínicos de narrativas no estructuradas, utilizando técnicas de recuperación de información y de procesamiento del lenguaje natural. La validación se realizó a través de conjuntos de datos proporcionados en desafíos científicos, concretamente en el National NLP Clinical Challenges (n2c2). En la etapa final, nos propusimos simplificar la ejecución de protocolos de estudios provenientes de múltiples centros, proponiendo soluciones novedosas para perfilar, publicar y facilitar el descubrimiento de bases de datos. Algunas de las soluciones desarrolladas se están utilizando actualmente en tres proyectos europeos destinados a crear redes federadas de bases de datos de salud en toda Europa.[Abstract] Multicentre health studies are important to increase the impact of medical research findings due to the number of subjects that they are able to engage. To simplify the execution of these studies, the data-sharing process should be effortless, for instance, through the use of interoperable databases. However, achieving this interoperability is still an ongoing research topic, namely due to data governance and privacy issues. In the first stage of this work, we propose several methodologies to optimise the harmonisation pipelines of health databases. This work was focused on harmonising heterogeneous data sources into a standard data schema, namely the OMOP CDM which has been developed and promoted by the OHDSI community. We validated our proposal using data sets of Alzheimer’s disease patients from distinct institutions. In the following stage, aiming to enrich the information stored in OMOP CDM databases, we have investigated solutions to extract clinical concepts from unstructured narratives, using information retrieval and natural language processing techniques. The validation was performed through datasets provided in scientific challenges, namely in the National NLP Clinical Challenges (n2c2). In the final stage, we aimed to simplify the protocol execution of multicentre studies, by proposing novel solutions for profiling, publishing and facilitating the discovery of databases. Some of the developed solutions are currently being used in three European projects aiming to create federated networks of health databases across Europe

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Technologies and Applications for Big Data Value

    Get PDF
    This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

    Towards cognitive in-operation network planning

    Get PDF
    Next-generation internet services such as live TV and video on demand require high bandwidth and ultra-low latency. The ever-increasing volume, dynamicity and stringent requirements of these services’ demands are generating new challenges to nowadays telecom networks. To decrease expenses, service-layer content providers are delivering their content near the end users, thus allowing a low latency and tailored content delivery. As a consequence of this, unseen metro and even core traffic dynamicity is arising with changes in the volume and direction of the traffic along the day. A tremendous effort to efficiently manage networks is currently ongoing towards the realisation of 5G networks. This translates in looking for network architectures supporting dynamic resource allocation, fulfilling strict service requirements and minimising the total cost of ownership (TCO). In this regard, in-operation network planning was recently proven to successfully support various network reconfiguration use cases in prospective scenarios. Nevertheless, additional research to extend in-operation planning capabilities from typical reactive optimization schemes to proactive and predictive schemes based on the analysis of network monitoring data is required. A hot topic raising increasing attention is cognitive networking, where an elevated knowledge about the network could be obtained as a result of introducing data analytics in the telecom operator’s infrastructure. By using predictive knowledge about the network traffic, in-operation network planning mechanisms could be enhanced to efficiently adapt the network by means of future traffic prediction, thus achieving cognitive in-operation network planning. In this thesis, we focus on studying mechanisms to enable cognitive in-operation network planning in core networks. In particular, we focus on dynamically reconfiguring virtual network topologies (VNT) at the MPLS layer, covering a number of detailed objectives. First, we start studying mechanisms to allow network traffic flow modelling, from monitoring and data transformation to the estimation of predictive traffic model based on this data. By means of these traffic models, then we tackle a cognitive approach to periodically adapt the core VNT to current and future traffic, using predicted traffic matrices based on origin-destination (OD) predictive models. This optimization approach, named VENTURE, is efficiently solved using dedicated heuristic algorithms and its feasibility is demonstrated in an experimental in-operation network planning environment. Finally, we extend VENTURE to consider core flows dynamicity as a result of metro flows re-routing, which represents a meaningful dynamic traffic scenario. This extension, which entails enhancements to coordinate metro and core network controllers with the aim of allowing fast adaption of core OD traffic models, is evaluated and validated in terms of traffic models accuracy and experimental feasibility.Els serveis d’internet de nova generació tals com la televisió en viu o el vídeo sota demanda requereixen d’un gran ample de banda i d’ultra-baixa latència. L’increment continu del volum, dinamicitat i requeriments d’aquests serveis està generant nous reptes pels teleoperadors de xarxa. Per reduir costs, els proveïdors de contingut estan disposant aquests més a prop dels usuaris finals, aconseguint així una entrega de contingut feta a mida. Conseqüentment, estem presenciant una dinamicitat mai vista en el tràfic de xarxes de metro amb canvis en la direcció i el volum del tràfic al llarg del dia. Actualment, s’està duent a terme un gran esforç cap a la realització de xarxes 5G. Aquest esforç es tradueix en cercar noves arquitectures de xarxa que suportin l’assignació dinàmica de recursos, complint requeriments de servei estrictes i minimitzant el cost total de la propietat. En aquest sentit, recentment s’ha demostrat com l’aplicació de “in-operation network planning” permet exitosament suportar diversos casos d’ús de reconfiguració de xarxa en escenaris prospectius. No obstant, és necessari dur a terme més recerca per tal d’estendre “in-operation network planning” des d’un esquema reactiu d’optimització cap a un nou esquema proactiu basat en l’analítica de dades provinents del monitoritzat de la xarxa. El concepte de xarxes cognitives es també troba al centre d’atenció, on un elevat coneixement de la xarxa s’obtindria com a resultat d’introduir analítica de dades en la infraestructura del teleoperador. Mitjançant un coneixement predictiu sobre el tràfic de xarxa, els mecanismes de in-operation network planning es podrien millorar per adaptar la xarxa eficientment basant-se en predicció de tràfic, assolint així el que anomenem com a “cognitive in-operation network Planning”. En aquesta tesi ens centrem en l’estudi de mecanismes que permetin establir “el cognitive in-operation network Planning” en xarxes de core. En particular, ens centrem en reconfigurar dinàmicament topologies de xarxa virtual (VNT) a la capa MPLS, cobrint una sèrie d’objectius detallats. Primer comencem estudiant mecanismes pel modelat de fluxos de tràfic de xarxa, des del seu monitoritzat i transformació fins a l’estimació de models predictius de tràfic. Posteriorment, i mitjançant aquests models predictius, tractem un esquema cognitiu per adaptar periòdicament la VNT utilitzant matrius de tràfic basades en predicció de parells origen-destí (OD). Aquesta optimització, anomenada VENTURE, és resolta eficientment fent servir heurístiques dedicades i és posteriorment avaluada sota escenaris de tràfic de xarxa dinàmics. A continuació, estenem VENTURE considerant la dinamicitat dels fluxos de tràfic de xarxes de metro, el qual representa un escenari rellevant de dinamicitat de tràfic. Aquesta extensió involucra millores per coordinar els operadors de metro i core amb l’objectiu d’aconseguir una ràpida adaptació de models de tràfic OD. Finalment, proposem dues arquitectures de xarxa necessàries per aplicar els mecanismes anteriors en entorns experimentals, emprant protocols estat-de-l’art com són OpenFlow i IPFIX. La metodologia emprada per avaluar el treball anterior consisteix en una primera avaluació numèrica fent servir un simulador de xarxes íntegrament dissenyat i desenvolupat per a aquesta tesi. Després d’aquesta validació basada en simulació, la factibilitat experimental de les arquitectures de xarxa proposades és avaluada en un entorn de proves distribuït

    Towards cognitive in-operation network planning

    Get PDF
    Next-generation internet services such as live TV and video on demand require high bandwidth and ultra-low latency. The ever-increasing volume, dynamicity and stringent requirements of these services’ demands are generating new challenges to nowadays telecom networks. To decrease expenses, service-layer content providers are delivering their content near the end users, thus allowing a low latency and tailored content delivery. As a consequence of this, unseen metro and even core traffic dynamicity is arising with changes in the volume and direction of the traffic along the day. A tremendous effort to efficiently manage networks is currently ongoing towards the realisation of 5G networks. This translates in looking for network architectures supporting dynamic resource allocation, fulfilling strict service requirements and minimising the total cost of ownership (TCO). In this regard, in-operation network planning was recently proven to successfully support various network reconfiguration use cases in prospective scenarios. Nevertheless, additional research to extend in-operation planning capabilities from typical reactive optimization schemes to proactive and predictive schemes based on the analysis of network monitoring data is required. A hot topic raising increasing attention is cognitive networking, where an elevated knowledge about the network could be obtained as a result of introducing data analytics in the telecom operator’s infrastructure. By using predictive knowledge about the network traffic, in-operation network planning mechanisms could be enhanced to efficiently adapt the network by means of future traffic prediction, thus achieving cognitive in-operation network planning. In this thesis, we focus on studying mechanisms to enable cognitive in-operation network planning in core networks. In particular, we focus on dynamically reconfiguring virtual network topologies (VNT) at the MPLS layer, covering a number of detailed objectives. First, we start studying mechanisms to allow network traffic flow modelling, from monitoring and data transformation to the estimation of predictive traffic model based on this data. By means of these traffic models, then we tackle a cognitive approach to periodically adapt the core VNT to current and future traffic, using predicted traffic matrices based on origin-destination (OD) predictive models. This optimization approach, named VENTURE, is efficiently solved using dedicated heuristic algorithms and its feasibility is demonstrated in an experimental in-operation network planning environment. Finally, we extend VENTURE to consider core flows dynamicity as a result of metro flows re-routing, which represents a meaningful dynamic traffic scenario. This extension, which entails enhancements to coordinate metro and core network controllers with the aim of allowing fast adaption of core OD traffic models, is evaluated and validated in terms of traffic models accuracy and experimental feasibility.Els serveis d’internet de nova generació tals com la televisió en viu o el vídeo sota demanda requereixen d’un gran ample de banda i d’ultra-baixa latència. L’increment continu del volum, dinamicitat i requeriments d’aquests serveis està generant nous reptes pels teleoperadors de xarxa. Per reduir costs, els proveïdors de contingut estan disposant aquests més a prop dels usuaris finals, aconseguint així una entrega de contingut feta a mida. Conseqüentment, estem presenciant una dinamicitat mai vista en el tràfic de xarxes de metro amb canvis en la direcció i el volum del tràfic al llarg del dia. Actualment, s’està duent a terme un gran esforç cap a la realització de xarxes 5G. Aquest esforç es tradueix en cercar noves arquitectures de xarxa que suportin l’assignació dinàmica de recursos, complint requeriments de servei estrictes i minimitzant el cost total de la propietat. En aquest sentit, recentment s’ha demostrat com l’aplicació de “in-operation network planning” permet exitosament suportar diversos casos d’ús de reconfiguració de xarxa en escenaris prospectius. No obstant, és necessari dur a terme més recerca per tal d’estendre “in-operation network planning” des d’un esquema reactiu d’optimització cap a un nou esquema proactiu basat en l’analítica de dades provinents del monitoritzat de la xarxa. El concepte de xarxes cognitives es també troba al centre d’atenció, on un elevat coneixement de la xarxa s’obtindria com a resultat d’introduir analítica de dades en la infraestructura del teleoperador. Mitjançant un coneixement predictiu sobre el tràfic de xarxa, els mecanismes de in-operation network planning es podrien millorar per adaptar la xarxa eficientment basant-se en predicció de tràfic, assolint així el que anomenem com a “cognitive in-operation network Planning”. En aquesta tesi ens centrem en l’estudi de mecanismes que permetin establir “el cognitive in-operation network Planning” en xarxes de core. En particular, ens centrem en reconfigurar dinàmicament topologies de xarxa virtual (VNT) a la capa MPLS, cobrint una sèrie d’objectius detallats. Primer comencem estudiant mecanismes pel modelat de fluxos de tràfic de xarxa, des del seu monitoritzat i transformació fins a l’estimació de models predictius de tràfic. Posteriorment, i mitjançant aquests models predictius, tractem un esquema cognitiu per adaptar periòdicament la VNT utilitzant matrius de tràfic basades en predicció de parells origen-destí (OD). Aquesta optimització, anomenada VENTURE, és resolta eficientment fent servir heurístiques dedicades i és posteriorment avaluada sota escenaris de tràfic de xarxa dinàmics. A continuació, estenem VENTURE considerant la dinamicitat dels fluxos de tràfic de xarxes de metro, el qual representa un escenari rellevant de dinamicitat de tràfic. Aquesta extensió involucra millores per coordinar els operadors de metro i core amb l’objectiu d’aconseguir una ràpida adaptació de models de tràfic OD. Finalment, proposem dues arquitectures de xarxa necessàries per aplicar els mecanismes anteriors en entorns experimentals, emprant protocols estat-de-l’art com són OpenFlow i IPFIX. La metodologia emprada per avaluar el treball anterior consisteix en una primera avaluació numèrica fent servir un simulador de xarxes íntegrament dissenyat i desenvolupat per a aquesta tesi. Després d’aquesta validació basada en simulació, la factibilitat experimental de les arquitectures de xarxa proposades és avaluada en un entorn de proves distribuït.Postprint (published version

    Optimisation of the enactment of fine-grained distributed data-intensive work flows

    Get PDF
    The emergence of data-intensive science as the fourth science paradigm has posed a data deluge challenge for enacting scientific work-flows. The scientific community is facing an imminent flood of data from the next generation of experiments and simulations, besides dealing with the heterogeneity and complexity of data, applications and execution environments. New scientific work-flows involve execution on distributed and heterogeneous computing resources across organisational and geographical boundaries, processing gigabytes of live data streams and petabytes of archived and simulation data, in various formats and from multiple sources. Managing the enactment of such work-flows not only requires larger storage space and faster machines, but the capability to support scalability and diversity of the users, applications, data, computing resources and the enactment technologies. We argue that the enactment process can be made efficient using optimisation techniques in an appropriate architecture. This architecture should support the creation of diversified applications and their enactment on diversified execution environments, with a standard interface, i.e. a work-flow language. The work-flow language should be both human readable and suitable for communication between the enactment environments. The data-streaming model central to this architecture provides a scalable approach to large-scale data exploitation. Data-flow between computational elements in the scientific work-flow is implemented as streams. To cope with the exploratory nature of scientific work-flows, the architecture should support fast work-flow prototyping, and the re-use of work-flows and work-flow components. Above all, the enactment process should be easily repeated and automated. In this thesis, we present a candidate data-intensive architecture that includes an intermediate work-flow language, named DISPEL. We create a new fine-grained measurement framework to capture performance-related data during enactments, and design a performance database to organise them systematically. We propose a new enactment strategy to demonstrate that optimisation of data-streaming work-flows can be automated by exploiting performance data gathered during previous enactments
    corecore