Search CORE

20 research outputs found

PlantRT : a Distributed Recommendation Tool for Citizen Science

Author: Champ Julien
Joly Alexis
Liroz-Gistau Miguel
Pacitti Esther
Servajean Maximilien
Publication venue: HAL CCSD
Publication date: 14/10/2014
Field of study

International audienceLes utilisateurs du Web 2.0 sont de gros producteurs de données diverses qu'ils stockent dans une grande variété de systèmes. Dans ce travail, nous nous concentrons sur le cas particulier des botanistes. En effet, établir une connaissance précise de l'identité, de la distribution géographique et de l'évolution des espèces vivantes est essentiel pour la pérennité de cette biodiversité, tout autant que pour l'espèce humaine. L'émergence des sciences citoyennes et des réseaux sociaux sont des outils supplémentaires favorisant la création de grandes communautés d'observateurs de la nature, qui ont commencé a produire d'énormes collections de données multimédias. Cependant, la complexité inhérente à la réalisation de ces collections provoque une certaine méfiance des utilisateurs, ces dernier ne souhaitant pas stocker leurs données sur un serveur central. Dans ce travail, nous avons réalisé un prototype multi-sites, où chaque site, peut représenter 1 à n utilisateurs permettant la recherche et la recommandation d'observations de plantes diversifiées à grand échelle

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Database replication protocols evaluation: study, implementation and execution of a set of database benchmarks

Author: Liroz Gistau Miguel
Publication venue
Publication date: 01/01/2009
Field of study

Ingeniería en InformáticaInformatika Ingeniaritz

Academica-e

Partitionnement dans les systèmes de gestion de données parallèles

Author: Liroz-Gistau Miguel
Publication venue: HAL CCSD
Publication date: 17/12/2013
Field of study

During the last years, the volume of data that is captured and generated has exploded. Advances in computer technologies, which provide cheap storage and increased computing capabilities, have allowed organizations to perform complex analysis on this data and to extract valuable knowledge from it. This trend has been very important not only for industry, but has also had a significant impact on science, where enhanced instruments and more complex simulations call for an efficient management of huge quantities of data.Parallel computing is a fundamental technique in the management of large quantities of data as it leverages on the concurrent utilization of multiple computing resources. To take advantage of parallel computing, we need efficient data partitioning techniques which are in charge of dividing the whole data and assigning the partitions to the processing nodes. Data partitioning is a complex problem, as it has to consider different and often contradicting issues, such as data locality, load balancing and maximizing parallelism.In this thesis, we study the problem of data partitioning, particularly in scientific parallel databases that are continuously growing and in the MapReduce framework.In the case of scientific databases, we consider data partitioning in very large databases in which new data is appended continuously to the database, e.g. astronomical applications. Existing approaches are limited since the complexity of the workload and continuous appends restrict the applicability of traditional approaches. We propose two partitioning algorithms that dynamically partition new data elements by a technique based on data affinity. Our algorithms enable us to obtain very good data partitions in a low execution time compared to traditional approaches.We also study how to improve the performance of MapReduce framework using data partitioning techniques. In particular, we are interested in efficient data partitioning of the input datasets to reduce the amount of data that has to be transferred in the shuffle phase. We design and implement a strategy which, by capturing the relationships between input tuples and intermediate keys, obtains an efficient partitioning that can be used to reduce significantly the MapReduce's communication overhead.Au cours des dernières années, le volume des données qui sont capturées et générées a explosé. Les progrès des technologies informatiques, qui fournissent du stockage à bas prix et une très forte puissance de calcul, ont permis aux organisations d'exécuter des analyses complexes de leurs données et d'en extraire des connaissances précieuses. Cette tendance a été très importante non seulement pour l'industrie, mais a également pour la science, où les meilleures instruments et les simulations les plus complexes ont besoin d'une gestion efficace des quantités énormes de données.Le parallélisme est une technique fondamentale dans la gestion de données extrêmement volumineuses car il tire parti de l'utilisation simultanée de plusieurs ressources informatiques. Pour profiter du calcul parallèle, nous avons besoin de techniques de partitionnement de données efficaces, qui sont en charge de la division de l'ensemble des données en plusieurs partitions et leur attribution aux nœuds de calculs. Le partitionnement de données est un problème complexe, car il doit prendre en compte des questions différentes et souvent contradictoires telles que la localité des données, la répartition de charge et la maximisation du parallélisme.Dans cette thèse, nous étudions le problème de partitionnement de données, en particulier dans les bases de données parallèles scientifiques qui sont continuellement en croissance. Nous étudions également ces partitionnements dans le cadre MapReduce.Dans le premier cas, nous considérons le partitionnement de très grandes bases de données dans lesquelles des nouveaux éléments sont ajoutés en permanence, avec pour exemple une application aux données astronomiques. Les approches existantes sont limitées à cause de la complexité de la charge de travail et l'ajout en continu de nouvelles données limitent l'utilisation d'approches traditionnelles. Nous proposons deux algorithmes de partitionnement dynamique qui attribuent les nouvelles données aux partitions en utilisant une technique basée sur l'affinité. Nos algorithmes permettent d'obtenir de très bons partitionnements des données en un temps d'exécution réduit comparé aux approches traditionnelles.Nous étudions également comment améliorer la performance du framework MapReduce en utilisant des techniques de partitionnement de données. En particulier, nous sommes intéressés par le partitionnement efficient de données d'entré

Thèses en Ligne

INRIA a CCSD electronic archive server

Data Partitioning in Parallel Data Management Systems

Author: Liroz Gistau Miguel
Publication venue
Publication date: 17/12/2013
Field of study

Au cours des dernières années, le volume des données qui sont capturées et générées a explosé. Les progrès des technologies informatiques, qui fournissent du stockage à bas prix et une très forte puissance de calcul, ont permis aux organisations d'exécuter des analyses complexes de leurs données et d'en extraire des connaissances précieuses. Cette tendance a été très importante non seulement pour l'industrie, mais a également pour la science, où les meilleures instruments et les simulations les plus complexes ont besoin d'une gestion efficace des quantités énormes de données.Le parallélisme est une technique fondamentale dans la gestion de données extrêmement volumineuses car il tire parti de l'utilisation simultanée de plusieurs ressources informatiques. Pour profiter du calcul parallèle, nous avons besoin de techniques de partitionnement de données efficaces, qui sont en charge de la division de l'ensemble des données en plusieurs partitions et leur attribution aux nœuds de calculs. Le partitionnement de données est un problème complexe, car il doit prendre en compte des questions différentes et souvent contradictoires telles que la localité des données, la répartition de charge et la maximisation du parallélisme.Dans cette thèse, nous étudions le problème de partitionnement de données, en particulier dans les bases de données parallèles scientifiques qui sont continuellement en croissance. Nous étudions également ces partitionnements dans le cadre MapReduce.Dans le premier cas, nous considérons le partitionnement de très grandes bases de données dans lesquelles des nouveaux éléments sont ajoutés en permanence, avec pour exemple une application aux données astronomiques. Les approches existantes sont limitées à cause de la complexité de la charge de travail et l'ajout en continu de nouvelles données limitent l'utilisation d'approches traditionnelles. Nous proposons deux algorithmes de partitionnement dynamique qui attribuent les nouvelles données aux partitions en utilisant une technique basée sur l'affinité. Nos algorithmes permettent d'obtenir de très bons partitionnements des données en un temps d'exécution réduit comparé aux approches traditionnelles.Nous étudions également comment améliorer la performance du framework MapReduce en utilisant des techniques de partitionnement de données. En particulier, nous sommes intéressés par le partitionnement efficient de données d'entréeDuring the last years, the volume of data that is captured and generated has exploded. Advances in computer technologies, which provide cheap storage and increased computing capabilities, have allowed organizations to perform complex analysis on this data and to extract valuable knowledge from it. This trend has been very important not only for industry, but has also had a significant impact on science, where enhanced instruments and more complex simulations call for an efficient management of huge quantities of data.Parallel computing is a fundamental technique in the management of large quantities of data as it leverages on the concurrent utilization of multiple computing resources. To take advantage of parallel computing, we need efficient data partitioning techniques which are in charge of dividing the whole data and assigning the partitions to the processing nodes. Data partitioning is a complex problem, as it has to consider different and often contradicting issues, such as data locality, load balancing and maximizing parallelism.In this thesis, we study the problem of data partitioning, particularly in scientific parallel databases that are continuously growing and in the MapReduce framework.In the case of scientific databases, we consider data partitioning in very large databases in which new data is appended continuously to the database, e.g. astronomical applications. Existing approaches are limited since the complexity of the workload and continuous appends restrict the applicability of traditional approaches. We propose two partitioning algorithms that dynamically partition new data elements by a technique based on data affinity. Our algorithms enable us to obtain very good data partitions in a low execution time compared to traditional approaches.We also study how to improve the performance of MapReduce framework using data partitioning techniques. In particular, we are interested in efficient data partitioning of the input datasets to reduce the amount of data that has to be transferred in the shuffle phase. We design and implement a strategy which, by capturing the relationships between input tuples and intermediate keys, obtains an efficient partitioning that can be used to reduce significantly the MapReduce's communication overhead

Theses.fr

Implementación de TCP-W para evaluar el rendimiento de protocolos de replicación en una arquitectura Middleware

Author: Liroz Gistau Miguel
Publication venue
Publication date: 01/01/2006
Field of study

Ingeniería Técnica IndustrialIndustria Ingeniaritza Tekniko

Academica-e

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Author: Akbarinia Reza
Liroz-Gistau Miguel
Valduriez Patrick
Publication venue: 'VLDB Endowment'
Publication date: 01/01/2015
Field of study

International audienceBig data parallel frameworks, such as MapReduce or Spark have been praised for their high scalability and performance, but show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this demonstration, we illustrate the use of FP-Hadoop, a system that efficiently deals with data skew in MapReduce jobs. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values , constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. Within the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieve excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time. During our demonstration, we give the users the possibility to execute and compare job executions in FP-Hadoop and Hadoop. They can retrieve general information about the job and the tasks and a summary of the phases. They can also visually compare different configurations to explore the difference between the approaches

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Partitionnement dans les systèmes de gestion de données parallèles

Author: LIROZ GISTAU Miguel
PACITTI-VALDURIEZ Esther
VALDURIEZ Patrick
Publication venue
Publication date: 01/01/2013
Field of study

OpenGrey Repository

FP-Hadoop

Author: Akbarinia Reza
Liroz-Gistau Miguel
Valduriez Patrick
Publication venue: HAL CCSD
Publication date: 08/04/2019
Field of study

FP-Hadoop makes the reduce side of Hadoop MapReduce more parallel and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel. Our experiments using FP-Hadoop using synthetic and real benchmarks have shown excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time

INRIA a CCSD electronic archive server

HAL-Rennes 1

An Efficient Solution for Processing Skewed MapReduce Jobs

Author: Agrawal Divyakant
Akbarinia Reza
Liroz-Gistau Miguel
Valduriez Patrick
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

International audienceAlthough MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers. In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Isolation Levels for Data Sharing in Large-Scale Scientific Workflows

Author: Bouziane Hinde Lilia
Liroz-Gistau Miguel
Pacitti Esther
Publication venue: HAL CCSD
Publication date: 25/05/2011
Field of study

Scientists can benefit from Grid and Cloud infrastructures to face the increasing need to share scientific data and execute data-intensive workflows at a large scale. However, these workflows are creating more and more challenging problems in the automation of data management during execution. Existing workflow management systems focus on how data is stored, transfered and on data provenance. However they lack in managing isolation during the execution of tasks of the same or different workflows that read/update shared data. In this scope, we propose three isolation levels taking into account data provenance and multiversioning. In the best of our knowledge this is the first proposal in such context

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot