12 research outputs found

    E-Biothon: an experimental platform for BioInformatics

    Get PDF
    International audienceThe E-Biothon platform is an experimental Cloud platform to help speed up and advance research in biology, health and environment. It is based on a Blue Gene/P system and a web portal that allow members of the bioinformatics community to easily launch their scientific applications. We describe in this paper the technical capacities of the platform, the different applications supported and finally a set of user experiences on the platform

    Synchronized navigation and comparative analyses across Ensembl complete bacterial genomes with INSYGHT.

    No full text
    International audienceMotivation: High-throughput sequencing technologies provide access to an increasing number of bacterial genomes. Today, many analyses involve the comparison of biological properties among many strains of a given species, or among species of a particular genus. Tools that can help the microbiologist with these tasks become increasingly important. Results: Insyght is a comparative visualization tool whose core features combine a synchronized navigation across genomic data of multiple organisms with a versatile interoperability between complementary views. In this work, we have greatly increased the scope of the Insyght public data- set by including 2688 complete bacterial genomes available in Ensembl thus vastly improving its phylogenetic coverage. We also report the development of a virtual machine that allows users to easily set up and customize their own local Insyght server

    A geometric view of Biodiversity: scaling to metagenomics

    Get PDF
    Nous avons conçu un algorithme de réduction de la dimension pour explorer de nouvellesvoies pour une caractérisation précise de la biodiversité, ici par une approche géométrique,qui satisfait aux critères de passage à l'échelle pour les jeux de données produits par NGS(actuellement 105\sim 10^5 reads). Cette aproche est basée sur la technique dite "Multidimensional Scaling",qui permet de projeter les éléments à étudier sur un ensemble de n points dans un espaceeuclidien de faible dimension, connaissant leurs distances respectives. Nous avons calculé toutesles distances deux à deux entre reads d'un échantillon environnemental, réalisé une MDS dutableau de distances, et analysé les projections sur les premiers axes par des techniques de visualisation.Nous avons abordé la question de la complexité quadratique du calcul des distances deux à deux en réalisant les calculs dans un Centre National disposant d'une machine hyperparallèle (Turing, une IBM BLue Gene Q), et la complexité cubique de la décomposition spectrale dans la MDS en utilisant un algorithme de projection aléatoire dense. Nous avons appliqué cette procédure à un jeu de 105\sim 10^5 reads d'un échantillon environnemental de diatomées du lac Léman.L'analyse de la forme du nuage de points obtenu ouvre la voie vers une analyse géométrique de la biodiversité, et une construction rigoureuse d'OTUs (Operational Taxonomic Units) lorsque le jeu de données est trop grand pour mettre en oeuvre les méthodes de classiffcation ascendante hiérarchique, non supervisée.We have designed a new efficient dimensionality reduction algorithm in order to investigate new ways of accurately characterizing the biodiversity, namely from a geometric point of view, scaling with large environmental sets produced by NGS (105\sim 10^5 sequences). The approach is based on Multidimensional Scaling (MDS) that allows for mapping items on a set of nn points into a low dimensional euclidean space given the set of pairwise distances. We compute all pairwise distances between reads in a given sample, run MDS on the distance matrix, and analyze the projection on first axis, by visualization tools. We have circumvented the quadratic complexity of computing pairwise distances by implementing it on a hyperparallel computer (Turing, a Blue Gene Q), and the cubic complexity of the spectral decomposition by implementing a dense random projection based algorithm. We have applied this data analysis scheme on a set of 10510^5 reads, which are amplicons of a diatom environmental sample from Lake Geneva. Analyzing the shape of the point cloud paves the way for a geometric analysis of biodiversity, and for accurately building OTUs (Operational Taxonomic Units), when the data set is too large for implementing unsupervised, hierarchical, high-dimensional clustering

    A geometric view of Biodiversity: scaling to metagenomics

    No full text
    Nous avons conçu un algorithme de réduction de la dimension pour explorer de nouvellesvoies pour une caractérisation précise de la biodiversité, ici par une approche géométrique,qui satisfait aux critères de passage à l'échelle pour les jeux de données produits par NGS(actuellement 105\sim 10^5 reads). Cette aproche est basée sur la technique dite "Multidimensional Scaling",qui permet de projeter les éléments à étudier sur un ensemble de n points dans un espaceeuclidien de faible dimension, connaissant leurs distances respectives. Nous avons calculé toutesles distances deux à deux entre reads d'un échantillon environnemental, réalisé une MDS dutableau de distances, et analysé les projections sur les premiers axes par des techniques de visualisation.Nous avons abordé la question de la complexité quadratique du calcul des distances deux à deux en réalisant les calculs dans un Centre National disposant d'une machine hyperparallèle (Turing, une IBM BLue Gene Q), et la complexité cubique de la décomposition spectrale dans la MDS en utilisant un algorithme de projection aléatoire dense. Nous avons appliqué cette procédure à un jeu de 105\sim 10^5 reads d'un échantillon environnemental de diatomées du lac Léman.L'analyse de la forme du nuage de points obtenu ouvre la voie vers une analyse géométrique de la biodiversité, et une construction rigoureuse d'OTUs (Operational Taxonomic Units) lorsque le jeu de données est trop grand pour mettre en oeuvre les méthodes de classiffcation ascendante hiérarchique, non supervisée.We have designed a new efficient dimensionality reduction algorithm in order to investigate new ways of accurately characterizing the biodiversity, namely from a geometric point of view, scaling with large environmental sets produced by NGS (105\sim 10^5 sequences). The approach is based on Multidimensional Scaling (MDS) that allows for mapping items on a set of nn points into a low dimensional euclidean space given the set of pairwise distances. We compute all pairwise distances between reads in a given sample, run MDS on the distance matrix, and analyze the projection on first axis, by visualization tools. We have circumvented the quadratic complexity of computing pairwise distances by implementing it on a hyperparallel computer (Turing, a Blue Gene Q), and the cubic complexity of the spectral decomposition by implementing a dense random projection based algorithm. We have applied this data analysis scheme on a set of 10510^5 reads, which are amplicons of a diatom environmental sample from Lake Geneva. Analyzing the shape of the point cloud paves the way for a geometric analysis of biodiversity, and for accurately building OTUs (Operational Taxonomic Units), when the data set is too large for implementing unsupervised, hierarchical, high-dimensional clustering

    diagno-syst: a tool for accurate inventories in metabarcoding

    No full text
    Metabarcoding on amplicons is rapidly expanding as a method to produce molecular based inventories of microbial communities. Here, we work on freshwater diatoms, which are microalgae possibly inventoried both on a morphological and a molecular basis. We have developed an algorithm, in a program called diagno-syst, based a the notion of informative read, which carries out supervised clustering of reads by mapping them exactly one by one on all reads of a well curated and taxonomically annotated reference database. This program has been run on a HPC (and HTC) infrastructure to address computation load. We compare optical and molecular based inventories on 10 samples from Léman lake, and 30 from Swedish rivers. We track all possibilities of mismatches between both approaches, and compare the results with standard pipelines (with heuristics) like Mothur. We find that the comparison with optics is more accurate when using exact calculations, at the price of a heavier computation load. It is crucial when studying the long tail of biodiversity, which may be overestimated by pipelines or algorithms using heuristics instead (more false positive). This work supports the analysis that these methods will benefit from progress in, first, building an agreement between molecular based and morphological based systematics and, second, having as complete as possible publicly available reference databases

    diagno-syst: a tool for accurate inventories in metabarcoding

    Get PDF
    Metabarcoding on amplicons is rapidly expanding as a method to produce molecular based inventories of microbial communities. Here, we work on freshwater diatoms, which are microalgae possibly inventoried both on a morphological and a molecular basis. We have developed an algorithm, in a program called diagno-syst, based a the notion of informative read, which carries out supervised clustering of reads by mapping them exactly one by one on all reads of a well curated and taxonomically annotated reference database. This program has been run on a HPC (and HTC) infrastructure to address computation load. We compare optical and molecular based inventories on 10 samples from Léman lake, and 30 from Swedish rivers. We track all possibilities of mismatches between both approaches, and compare the results with standard pipelines (with heuristics) like Mothur. We find that the comparison with optics is more accurate when using exact calculations, at the price of a heavier computation load. It is crucial when studying the long tail of biodiversity, which may be overestimated by pipelines or algorithms using heuristics instead (more false positive). This work supports the analysis that these methods will benefit from progress in, first, building an agreement between molecular based and morphological based systematics and, second, having as complete as possible publicly available reference databases
    corecore