24 research outputs found

    Ordonnancement dynamique des transferts dans MapReduce sous contrainte de bande passante

    Get PDF
    National audienceDe nombreux domaines scientifiques font désormais face à un déluge de données. L'une des approches proposées pour permettre le traitement de tels volumes est le paradigme de programmation MapReduce introduit par Google. Ce schéma d'exécution trÚs simple se compose de deux phases, map et reduce entre lesquelles a lieu une phase d'échange massif de données entre les machines exécutant l'application. Dans cet article, nous proposons un systÚme linéaire définissant un partitionnement des données à traiter et un algorithme d'ordonnancement dynamique des transferts afin d'optimiser cette phase intermédiaire. Nous comparons cette approche à celle reposant sur un programme linéaire et un ordonnancement statique par phases. Les expériences menées montrent que notre approche produit des ordonnancements plus compacts en un temps bien plus court

    Dynamic Scheduling of MapReduce Shuffle under Bandwidth Constraints

    Get PDF
    Whether it is for e-science or business, the amount of data produced every year is growing at a high rate. Managing and processing those data raises new challenges. MapReduce is one answer to the need for scalable tools able to handle the amount of data. It imposes a general structure of computation and let the implementation perform its optimizations. During the computation, there is a phase called Shuffle where every node sends a possibly large amount of data to every other node. This report proposes and evaluates six algorithms to improve data transfers during the Shuffle phase under bandwidth constraints.Que ce soit pour l’e-science ou pour les affaires, la quantitĂ© de donnĂ©es produites chaque annĂ©e augmente Ă  une vitesse vertigineuse. GĂ©rer et traiter ces donnĂ©es soulĂšve de nouveaux dĂ©fis. MapReduce est l’une des rĂ©ponses aux besoins d’outils qui passent Ă  l’échelle et capables de gĂ©rer ces volumes de donnĂ©es. Il impose une structure gĂ©nĂ©rale de calcul et laisse l’implĂ©mentation effectuer ses optimisations. Durant l’une des phases du calcul appelĂ©e Shuffle, tous les nƓuds envoient des donnĂ©es potentiellement grosses Ă  tous les autres nƓuds. Ce rapport propose et Ă©value six algorithmes pour amĂ©liorer le transfert des donnĂ©es durant cette phase de Shuffle sous des contraintes de bande passante

    Scheduling/Data Management Heuristics

    Get PDF
    Deliverable D3.1 of MapReduce ANR projectData volume produced by scientific applications increase at a high speed. Some are expected to produce several petabyte per year. In order to process this amount of data, the computing power of several hundreds or thousands of machines have to be used at the same time. Regarding this, one of the biggest challenge is: how to program these machines in order to make them to collaborate for the same computation? One answer brought by Google is the MapReduce paradigm. MapReduce has the advantage of being quite simple to program for the user and handle on its own the repetitive or complex tasks like the data transfers between nodes, task scheduling or handling node failure. These automatic tasks have to be handled in an optimized way in order to make the framework fast and scalable. This report presents our first studies towards an efficient scheduling of MapReduce operations. More specifically, we focused on the scheduling of the data transfers together with the tasks. We present here an interesting work around this topic and our algorithm which improves their results

    Enabling near-atomic-scale analysis of frozen water

    Get PDF
    Transmission electron microscopy has undergone a revolution in recent years with the possibility to perform routine cryo-imaging of biological materials and (bio)chemical systems, as well as the possibility to image liquids via dedicated reaction cells or graphene-sandwiching. These approaches however typically require imaging a large number of specimens and reconstructing an average representation and often lack analytical capabilities. Here, using atom probe tomography we provide atom-by-atom analyses of frozen liquids and analytical sub-nanometre three dimensional reconstructions. The analyzed ice is in contact with, and embedded within, nanoporous gold (NPG). We report the first such data on 2-3 microns thick layers of ice formed from both high purity deuterated water and a solution of 50mM NaCl in high purity deuterated water. We present a specimen preparation strategy that uses a NPG film and, additionally, we report on an analysis of the interface between nanoporous gold and frozen salt water solution with an apparent trend in the Na and Cl concentrations across the interface. We explore a range of experimental parameters to show that the atom probe analyses of bulk aqueous specimens come with their own special challenges and discuss physical processes that may produce the observed phenomena. Our study demonstrates the viability of using frozen water as a carrier for near-atomic scale analysis of objects in solution by atom probe tomography

    COVID-19 symptoms at hospital admission vary with age and sex: results from the ISARIC prospective multinational observational study

    Get PDF
    Background: The ISARIC prospective multinational observational study is the largest cohort of hospitalized patients with COVID-19. We present relationships of age, sex, and nationality to presenting symptoms. Methods: International, prospective observational study of 60 109 hospitalized symptomatic patients with laboratory-confirmed COVID-19 recruited from 43 countries between 30 January and 3 August 2020. Logistic regression was performed to evaluate relationships of age and sex to published COVID-19 case definitions and the most commonly reported symptoms. Results: ‘Typical’ symptoms of fever (69%), cough (68%) and shortness of breath (66%) were the most commonly reported. 92% of patients experienced at least one of these. Prevalence of typical symptoms was greatest in 30- to 60-year-olds (respectively 80, 79, 69%; at least one 95%). They were reported less frequently in children (≀ 18 years: 69, 48, 23; 85%), older adults (≄ 70 years: 61, 62, 65; 90%), and women (66, 66, 64; 90%; vs. men 71, 70, 67; 93%, each P < 0.001). The most common atypical presentations under 60 years of age were nausea and vomiting and abdominal pain, and over 60 years was confusion. Regression models showed significant differences in symptoms with sex, age and country. Interpretation: This international collaboration has allowed us to report reliable symptom data from the largest cohort of patients admitted to hospital with COVID-19. Adults over 60 and children admitted to hospital with COVID-19 are less likely to present with typical symptoms. Nausea and vomiting are common atypical presentations under 30 years. Confusion is a frequent atypical presentation of COVID-19 in adults over 60 years. Women are less likely to experience typical symptoms than men

    Amélioration des performances de MapReduce sur grappe de calcul

    No full text
    Nowadays, more and more scientific fields rely on data mining to produce new results. These raw data are produced at an increasing rate by several tools like DNA sequencers in biology, the Large Hadron Collider (LHC) in physics that produced 25 petabytes per year as of 2012, or the Large Synoptic Survey Telescope (LSST) that should produce 30 petabyte of data per night. High-resolution scanners in medical imaging and social networks also produce huge amounts of data. This data deluge raise several challenges in terms of storage and computer processing. The Google company proposed in 2004 to use the MapReduce model in order to distribute the computation across several computers.This thesis focus mainly on improving the performance of a MapReduce environment. In order to easily replace the software parts needed to improve the performance, designing a modular and adaptable MapReduce environment is necessary. This is why a component based approach is studied in order to design such a programming environment. In order to study the performance of a MapReduce application, modeling the platform, the application and their performance is mandatory. These models should be both precise enough for the algorithms using them to produce meaningful results, but also simple enough to be analyzed. A state of the art of the existing models is done and a new model adapted to the needs is defined. On order to optimise a MapReduce environment, the first studied approach is a global optimization which result in a computation time reduced by up to 47 %. The second approach focus on the shuffle phase of MapReduce when all the nodes may send some data to every other node. Several algorithms are defined and studied when the network is the bottleneck of the data transfers. These algorithms are tested on the Grid'5000 experiment platform and usually show a behavior close to the lower bound while the trivial approach is far from it.Beaucoup de disciplines scientifiques s'appuient dĂ©sormais sur l'analyse et la fouille de masses gigantesques de donnĂ©es pour produire de nouveaux rĂ©sultats. Ces donnĂ©es brutes sont produites Ă  des dĂ©bits toujours plus Ă©levĂ©s par divers types d'instruments tels que les sĂ©quenceurs d'ADN en biologie, le Large Hadron Collider (LHC) qui produisait en 2012, 25 pĂ©taoctets par an, ou les grands tĂ©lescopes tels que le Large Synoptic Survey Telescope (LSST) qui devrait produire 30 pĂ©taoctets par nuit. Les scanners haute rĂ©solution en imagerie mĂ©dicale et l'analyse de rĂ©seaux sociaux produisent Ă©galement d'Ă©normes volumes de donnĂ©es. Ce dĂ©luge de donnĂ©es soulĂšve de nombreux dĂ©fis en termes de stockage et de traitement informatique. L'entreprise Google a proposĂ© en 2004 d'utiliser le modĂšle de calcul MapReduce afin de distribuer les calculs sur de nombreuses machines.Cette thĂšse s'intĂ©resse essentiellement Ă  amĂ©liorer les performances d'un environnement MapReduce. Pour cela, une conception modulaire et adaptable d'un environnement MapReduce est nĂ©cessaire afin de remplacer aisĂ©ment les briques logicielles nĂ©cessaires Ă  l'amĂ©lioration des performances. C'est pourquoi une approche Ă  base de composants est Ă©tudiĂ©e pour concevoir un tel environnement de programmation. Afin d'Ă©tudier les performances d'une application MapReduce, il est nĂ©cessaire de modĂ©liser la plate-forme, l'application et leurs performances. Ces modĂšles doivent ĂȘtre Ă  la fois suffisamment prĂ©cis pour que les algorithmes les utilisant produisent des rĂ©sultats pertinents, mais aussi suffisamment simple pour ĂȘtre analysĂ©s. Un Ă©tat de l'art des modĂšles existants est effectuĂ© et un nouveau modĂšle correspondant aux besoins d'optimisation est dĂ©fini. De maniĂšre Ă  optimiser un environnement MapReduce la premiĂšre approche Ă©tudiĂ©e est une approche d'optimisation globale qui aboutit Ă  une amĂ©lioration du temps de calcul jusqu'Ă  47 %. La deuxiĂšme approche se concentre sur la phase de shuffle de MapReduce oĂč tous les nƓuds envoient potentiellement des donnĂ©es Ă  tous les autres nƓuds. DiffĂ©rents algorithmes sont dĂ©finis et Ă©tudiĂ©s dans le cas oĂč le rĂ©seau est un goulet d'Ă©tranglement pour les transferts de donnĂ©es. Ces algorithmes sont mis Ă  l'Ă©preuve sur la plate-forme expĂ©rimentale Grid'5000 et montrent souvent un comportement proche de la borne infĂ©rieure alors que l'approche naĂŻve en est Ă©loignĂ©e

    Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

    No full text
    International audienceAs Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today's frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map-Reduce frameworks, in order to achieve scalable, concurrency-optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids
    corecore