28 research outputs found

    QTL mapping in CC mice using qtl2-derived tools (ccqtl)

    No full text
    CCQTL is a wrapper around the R/qtl2 (Broman et al, Genetics 2019 10.1534/genetics.118.301595) functions, with hard-coded parameters tailored for QTL mapping in the Collaborative Cross

    On-target activity predictions enable improved CRISPR-dCas9 screens in bacteria

    No full text
    International audienceThe ability to block gene expression in bacteria with the catalytically inactive mutant of Cas9, known as dCas9, is quickly becoming a standard methodology to probe gene function, perform high-throughput screens, and engineer cells for desired purposes. Yet, we still lack a good understanding of the design rules that determine on-target activity for dCas9. Taking advantage of high-throughput screening data, we fit a model to predict the ability of dCas9 to block the RNA polymerase based on the target sequence, and validate its performance on independently generated datasets. We further design a novel genome wide guide RNA library for E. coli MG1655, EcoWG1, using our model to choose guides with high activity while avoiding guides which might be toxic or have off-target effects. A screen performed using the EcoWG1 library during growth in rich medium improved upon previously published screens, demonstrating that very good performances can be attained using only a small number of well designed guides. Being able to design effective, smaller libraries will help make CRISPRi screens even easier to perform and more cost-effective. Our model and materials are available to the community through crispr.pasteur.fr and Addgene

    PPanGGOLiN: Depicting microbial diversity via a Partitioned Pangenome Graph

    No full text
    National audienceMotivations : By collecting and comparing all the genomic sequences of a species, pangenomics studies focus on overall genomic content to understand genome evolution both in terms of core and accessory parts (Tettelin et al. 2005). The core genome is defined as the set of genes shared by all the organisms of a taxonomic unit (generally a species) whereas the accessory part (also named, variable regions or peripheral regions) is crucial to understand the adaptive potential of bacteria and contains genomic regions that can be exchanged between strains by horizontal transfers (i.e. the mobilome, Frost et al. 2005). Core genes are most often defined as the set of ubiquitous genes in a clade (Tettelin et al. 2005 and Vieira et al. 2011). However, this definition has 2 major flaws : (i) it is not robust against poorly sampled data because it is highly reliant on the presence or absence of a single organism; (ii) it misses many core genes (false negatives) because of the high probability to lose at least one of the core genes due to sequencing, assembling or annotation artifacts. Potential presence in the dataset of variants missing a gene because the associated function is socialized [sic] in a community (see the Black Queen Hypothesis, Morris et al. 2012) can also drop down the core genome. As pointed out by (AcevedoRocha et al. 2013), "functional ubiquity cannot be equated to sequence/structural ubiquity", the core genome definition has thus been pointed out as too conservative for being useful (Tonder et al. 2014). As a consequence, (AcevedoRocha et al. 2013) propose to rather focus on "persistent" genes, namely genes that are conserved in a majority of genomes. Some equivalent words to 'persistent' have also been introduced as 'soft core' (ContrerasMoreira et al. 2013) or 'extended core' (Lapierre et al. 2009, Bolotin et al. 2017), 'stabilome' (Vesth et al. 2010). This definition advocates for the use of a threshold on the frequency of appearance of a gene among the species of a clade, above which the gene is declared as a persistent one (generally gene families present at least in a range comprised between 90% and 95%). This approach gives an attractive answer to the issues raised by the original definition of the core genome but nevertheless has its own disadvantage that lies in choosing an appropriate threshold. Moreover, the usual dichotomy between core and accessory genome does not faithfully report the diverse ranges of gene frequencies in a pangenome. The gene frequency distribution in pangenomes is extensively documented (Lapierre et al. 2009, Collins et al. 2012, Lobkovsky et al. 2013, Bolotin et al. 2015, Bolotin et al. 2017). These studies argue for the existence of an equilibrium between genes acquisition and genes loss leading to an asymmetric U-shaped distribution of gene frequencies regardless of the phylogenetic level and the clade considered (with the exception of the non-homogeneous species (Moldovan et al. 2018)). The U left, bottom and right sides correspond respectively to the rare, moderately present and highly frequent gene families. Thereby, as proposed by (Koonin,2008) and formally modeled by (Collins et al. 2012), the pangenome can be split into 3 groups.This choice helps to shed light on genes putatively associated with positive environmental adaptations while avoiding to confound them with potentially randomly acquired ones. For that purpose, the partitioning approach that we propose here divides the pangenome into (1) persistent genome, equivalent to a relaxed core genome (genes conserved in almost all genomes); (2) shell genome, genes having intermediate frequencies corresponding to moderately conserved genes potentially associated to environmental adaptation capabilities; (3) cloud genome, genes found at a very low frequency. We tackle this challenge in the present work by first proposing a method to select this threshold automatically. Beyond the partitioning approach, the technological shifts of the sequencing methods offer us thousands of genome strains available in databases for numerous bacterial species. The processing of so many genomes poses a critical computational problem because it is no longer possible to handle comparative genomics studies as in the 90's, even with modern computing facilities. For instance, studying patterns of gene gains and losses in the evolution of a lineage is a basic question in comparative genomics but this task becomes tremendously harder when thousands of genomes have to be analyzed. Nevertheless, the information encoded in these genomes is highly redundant making it possible to design new compact ways of representing and manipulating this information. As suggested (Chan et al. 2015 and Marshall et al. 2016), a consensus representation of multiple genomes would provide a better analytical framework than using individual reference genomes. This proposition leads to a paradigm shift from the usual linear representation of reference genomes to a representation as pangenome graphs bringing together all the different known variations as multiple alternative paths. Some approaches have been developed aiming at factorizing pangenomes at the sequence level (PanCake : Ernst et al. 2013, SplitMEM : Marcus et al. 2014). However, these approaches lack direct information about genes, complicating the functional analyses from the study of the graph. Here, we introduce an extension of the concept of pangenome graph, giving it a formal mathematical representation using a graph model in which nodes represent gene families and chromosomal neighborhood information, respectively. The method introduced here can be considered as the missing link between the usual pangenomics approach (set of unlinked gene families) and the pangenome graph at the sequence level. A detailed comparison of these 2 approaches has been reviewed in (Zekic et al. 2018). Coupled with our partitioning method, this representation could be a new standard to depict all the genomic combinations of bacterial species in a single figure. Overview of the method : First, the genomes of the same species (or species cluster) are annotated before bringing homologue genes together into gene families via a all vs all protein alignment. From this data, the PPanGGOLiN method merges the chromosomal links between neighboring genes to build a graph of the neighborhood between gene families weighted by the number of genomes covering each edge. In parallel, the pangenome is modeled as a binary presence/absence matrix where the rows correspond to gene families and columns to the organisms (1 in case of presence of at least one gene belonging to this gene family, 0 in case of absence). The pangenome is then partitioned into the persistent, shell and cloud partitions by evaluating, through an Expectation-Maximisation algorithm, the best parameters of a Bernoulli Mixture Model (BMM) smoothed using a Markov Random Field (MRF) (Ambroise et al. 1997). For each partition, the BMM is composed of one mean vector of presence/absence (expected to be (11...11) for the persistent, (00...00) for the cloud and diversified for the shell) associated to a dispersion vector around the mean vector (low dispersion for the persistent and the cloud; high dispersion for the shell). Once the parameters are estimated, each gene family is associated to its closest partition according to its mean vector. As it is known that core gene families share conserved genomic organizations along genomes (Fang et al. 2008), the MRF imposes that two neighboring gene families are more likely to belong to the same partition. Therefore, the MFR penalizes unreliable partition attributed to the families compared to the partition of its neighbors in the graph (the weights of the edges account in the process). The algorithm iterates between BBM and MRF until the maximization of the overall likelihood. The strength of the topological smoothing is managed via a parameter called ÎČ (if ÎČ = 0, the smoothing is disabled and the partitioning only relies on the presence/absence matrix). At the end, the partitions are then overlaid on the neighborhood graph in order to obtain what we called the Partitioned Pangenome Graph. Thanks to this graphical structure and the associated statistical model, the pangenome is resilient to randomly distributed errors (e.g. an assembly gap in one genome can be offset by information from other genomes, thus maintaining the link in the graph).Conclusion:Due to the significant decreasing cost of recent sequencing technologies, the past recent years have seen the explosion of whole-genome sequencing projects (WGS), most notably for pathogenic bacteria. Using portable sequencer like ONT MinION, it is soon imaginable to obtain thousands of strains for each species because of the simplicity to sequence bacteria directly on the field. Therefore, the capture of all genomic variations of a species is no longer a wishful thinking. Before the emergence of the pangenomics, the emphasis has been on identifying polymorphism information to draw some sort of epidemiological map of the lineage(s) of interest. While this has resulted in the remarkably detailed information of epidemic strains, it is rapidly showing its major weakness since the analysis of the core genes actually provides very little information on the adaptive changes because most of them arise in the shell and cloud genomes. The approach presented here sheds light on these variations to focus on the gene gains and losses that are associated with these adaptive changes in a species. In the context of comparative genomics, drawing genomes on rails like a subway map may help biologist to compare genomes of interest to the overall pangenomic diversity. This graph-based approach to represent and manipulate pangenomes provides efficient bases for very large scale comparative genomics. The method is available as a standalone tool (https://github.com/ggautreau/PPanGGOLiN) and, as mentioned in (Vallenet et al. 2017), we are currently working on its integration in the MicroScope platform

    Web development, expertise of the Web INTEgRation group in the Bioinformatic and Biostatistics Hub

    No full text
    International audienceThe expertise group named WINTER, for Web INTEgRation, is a software development team whose aim is to build websites and apps using Web technologies. Like all members of Bioinformatics and Biostatistics Hub, the team provides support to Institut Pasteur’s research units and platforms for publishing and sharing analysis, data, scientific tools, and workflows. Several competences and informatic languages are necessary to create a complete user-friendly web tool. The web development process follows some essential steps: - Design and create the interface model considering user experience (UX) and user interface (UI) aspects, according to user needs. - Develop the front-end, which is the visible part of the website, involving the implementation of the interface code from the previously designed model, - Develop the back-end, the hidden part of website, to carry out actions enabled by the interface, and to store and manage the data in database. - Maintain and deploy the website through continuous integration and continuous delivery/deployment as well as versioning. The Winter Team consists of several members who are experts in web development. We provide our expertise to the scientists on campus, covering a broad range of services to design, develop, deploy, maintain web interface and databases creating fully functional dedicated tools for scientific topics. Depending on the project’s needs and size, Winter team members can work alone as full-stack DevOps members or contribute specifically to a stage as domain experts (such as UX/UI Designer, Front-end Developer or Back-end Developer), in close collaboration with the other team members. Thanks to projects covering a wide variety of scientific topics such as Structural Bioinformatics, Transcriptomics, Statistical Genetics, Antibiotic Resistance, Phylogeny, and Oncology, we collaborate with several research units and teams on campus, as well as other groups within the Hub, with the support of the IT department. Over the past years, the team has created more than 15 web applications and databases and has participated in various projects, including: - ABSD - AMR Spread - Bioflow-Insight - DefenseFinder - InDeep-net - JASS Additionally, our group is involved in external collaborations with national bioinformatic partners such as the “French Institute of Bioinformatics” (e.g. participation and leadership in WP6 of ABRomics project) and European partners (e.g development of oncodash software in DECIDER project) We oversee the Galaxy server of the Institut Pasteur, an integration platform to publish and use bioinformatics tools and workflows in a web interface

    PPanGGOLiN: Depicting microbial diversity via a Partitioned Pangenome Graph

    No full text
    National audienceMotivations : By collecting and comparing all the genomic sequences of a species, pangenomics studies focus on overall genomic content to understand genome evolution both in terms of core and accessory parts (Tettelin et al. 2005). The core genome is defined as the set of genes shared by all the organisms of a taxonomic unit (generally a species) whereas the accessory part (also named, variable regions or peripheral regions) is crucial to understand the adaptive potential of bacteria and contains genomic regions that can be exchanged between strains by horizontal transfers (i.e. the mobilome, Frost et al. 2005). Core genes are most often defined as the set of ubiquitous genes in a clade (Tettelin et al. 2005 and Vieira et al. 2011). However, this definition has 2 major flaws : (i) it is not robust against poorly sampled data because it is highly reliant on the presence or absence of a single organism; (ii) it misses many core genes (false negatives) because of the high probability to lose at least one of the core genes due to sequencing, assembling or annotation artifacts. Potential presence in the dataset of variants missing a gene because the associated function is socialized [sic] in a community (see the Black Queen Hypothesis, Morris et al. 2012) can also drop down the core genome. As pointed out by (AcevedoRocha et al. 2013), "functional ubiquity cannot be equated to sequence/structural ubiquity", the core genome definition has thus been pointed out as too conservative for being useful (Tonder et al. 2014). As a consequence, (AcevedoRocha et al. 2013) propose to rather focus on "persistent" genes, namely genes that are conserved in a majority of genomes. Some equivalent words to 'persistent' have also been introduced as 'soft core' (ContrerasMoreira et al. 2013) or 'extended core' (Lapierre et al. 2009, Bolotin et al. 2017), 'stabilome' (Vesth et al. 2010). This definition advocates for the use of a threshold on the frequency of appearance of a gene among the species of a clade, above which the gene is declared as a persistent one (generally gene families present at least in a range comprised between 90% and 95%). This approach gives an attractive answer to the issues raised by the original definition of the core genome but nevertheless has its own disadvantage that lies in choosing an appropriate threshold. Moreover, the usual dichotomy between core and accessory genome does not faithfully report the diverse ranges of gene frequencies in a pangenome. The gene frequency distribution in pangenomes is extensively documented (Lapierre et al. 2009, Collins et al. 2012, Lobkovsky et al. 2013, Bolotin et al. 2015, Bolotin et al. 2017). These studies argue for the existence of an equilibrium between genes acquisition and genes loss leading to an asymmetric U-shaped distribution of gene frequencies regardless of the phylogenetic level and the clade considered (with the exception of the non-homogeneous species (Moldovan et al. 2018)). The U left, bottom and right sides correspond respectively to the rare, moderately present and highly frequent gene families. Thereby, as proposed by (Koonin,2008) and formally modeled by (Collins et al. 2012), the pangenome can be split into 3 groups.This choice helps to shed light on genes putatively associated with positive environmental adaptations while avoiding to confound them with potentially randomly acquired ones. For that purpose, the partitioning approach that we propose here divides the pangenome into (1) persistent genome, equivalent to a relaxed core genome (genes conserved in almost all genomes); (2) shell genome, genes having intermediate frequencies corresponding to moderately conserved genes potentially associated to environmental adaptation capabilities; (3) cloud genome, genes found at a very low frequency. We tackle this challenge in the present work by first proposing a method to select this threshold automatically. Beyond the partitioning approach, the technological shifts of the sequencing methods offer us thousands of genome strains available in databases for numerous bacterial species. The processing of so many genomes poses a critical computational problem because it is no longer possible to handle comparative genomics studies as in the 90's, even with modern computing facilities. For instance, studying patterns of gene gains and losses in the evolution of a lineage is a basic question in comparative genomics but this task becomes tremendously harder when thousands of genomes have to be analyzed. Nevertheless, the information encoded in these genomes is highly redundant making it possible to design new compact ways of representing and manipulating this information. As suggested (Chan et al. 2015 and Marshall et al. 2016), a consensus representation of multiple genomes would provide a better analytical framework than using individual reference genomes. This proposition leads to a paradigm shift from the usual linear representation of reference genomes to a representation as pangenome graphs bringing together all the different known variations as multiple alternative paths. Some approaches have been developed aiming at factorizing pangenomes at the sequence level (PanCake : Ernst et al. 2013, SplitMEM : Marcus et al. 2014). However, these approaches lack direct information about genes, complicating the functional analyses from the study of the graph. Here, we introduce an extension of the concept of pangenome graph, giving it a formal mathematical representation using a graph model in which nodes represent gene families and chromosomal neighborhood information, respectively. The method introduced here can be considered as the missing link between the usual pangenomics approach (set of unlinked gene families) and the pangenome graph at the sequence level. A detailed comparison of these 2 approaches has been reviewed in (Zekic et al. 2018). Coupled with our partitioning method, this representation could be a new standard to depict all the genomic combinations of bacterial species in a single figure. Overview of the method : First, the genomes of the same species (or species cluster) are annotated before bringing homologue genes together into gene families via a all vs all protein alignment. From this data, the PPanGGOLiN method merges the chromosomal links between neighboring genes to build a graph of the neighborhood between gene families weighted by the number of genomes covering each edge. In parallel, the pangenome is modeled as a binary presence/absence matrix where the rows correspond to gene families and columns to the organisms (1 in case of presence of at least one gene belonging to this gene family, 0 in case of absence). The pangenome is then partitioned into the persistent, shell and cloud partitions by evaluating, through an Expectation-Maximisation algorithm, the best parameters of a Bernoulli Mixture Model (BMM) smoothed using a Markov Random Field (MRF) (Ambroise et al. 1997). For each partition, the BMM is composed of one mean vector of presence/absence (expected to be (11...11) for the persistent, (00...00) for the cloud and diversified for the shell) associated to a dispersion vector around the mean vector (low dispersion for the persistent and the cloud; high dispersion for the shell). Once the parameters are estimated, each gene family is associated to its closest partition according to its mean vector. As it is known that core gene families share conserved genomic organizations along genomes (Fang et al. 2008), the MRF imposes that two neighboring gene families are more likely to belong to the same partition. Therefore, the MFR penalizes unreliable partition attributed to the families compared to the partition of its neighbors in the graph (the weights of the edges account in the process). The algorithm iterates between BBM and MRF until the maximization of the overall likelihood. The strength of the topological smoothing is managed via a parameter called ÎČ (if ÎČ = 0, the smoothing is disabled and the partitioning only relies on the presence/absence matrix). At the end, the partitions are then overlaid on the neighborhood graph in order to obtain what we called the Partitioned Pangenome Graph. Thanks to this graphical structure and the associated statistical model, the pangenome is resilient to randomly distributed errors (e.g. an assembly gap in one genome can be offset by information from other genomes, thus maintaining the link in the graph).Conclusion:Due to the significant decreasing cost of recent sequencing technologies, the past recent years have seen the explosion of whole-genome sequencing projects (WGS), most notably for pathogenic bacteria. Using portable sequencer like ONT MinION, it is soon imaginable to obtain thousands of strains for each species because of the simplicity to sequence bacteria directly on the field. Therefore, the capture of all genomic variations of a species is no longer a wishful thinking. Before the emergence of the pangenomics, the emphasis has been on identifying polymorphism information to draw some sort of epidemiological map of the lineage(s) of interest. While this has resulted in the remarkably detailed information of epidemic strains, it is rapidly showing its major weakness since the analysis of the core genes actually provides very little information on the adaptive changes because most of them arise in the shell and cloud genomes. The approach presented here sheds light on these variations to focus on the gene gains and losses that are associated with these adaptive changes in a species. In the context of comparative genomics, drawing genomes on rails like a subway map may help biologist to compare genomes of interest to the overall pangenomic diversity. This graph-based approach to represent and manipulate pangenomes provides efficient bases for very large scale comparative genomics. The method is available as a standalone tool (https://github.com/ggautreau/PPanGGOLiN) and, as mentioned in (Vallenet et al. 2017), we are currently working on its integration in the MicroScope platform

    PPanGGOLiN: Depicting microbial diversity via a Partitioned Pangenome Graph

    No full text
    International audienceMotivations : By collecting and comparing all the genomic sequences of a species, pangenomics studies focus on overall genomic content to understand genome evolution both in terms of core and accessory parts (Tettelin et al. 2005). The core genome is defined as the set of genes shared by all the organisms of a taxonomic unit (generally a species) whereas the accessory part (also named, variable regions or peripheral regions) is crucial to understand the adaptive potential of bacteria and contains genomic regions that can be exchanged between strains by horizontal transfers (i.e. the mobilome, Frost et al. 2005). Core genes are most often defined as the set of ubiquitous genes in a clade (Tettelin et al. 2005 and Vieira et al. 2011). However, this definition has 2 major flaws : (i) it is not robust against poorly sampled data because it is highly reliant on the presence or absence of a single organism; (ii) it misses many core genes (false negatives) because of the high probability to lose at least one of the core genes due to sequencing, assembling or annotation artifacts. Potential presence in the dataset of variants missing a gene because the associated function is socialized [sic] in a community (see the Black Queen Hypothesis, Morris et al. 2012) can also drop down the core genome. As pointed out by (AcevedoRocha et al. 2013), "functional ubiquity cannot be equated to sequence/structural ubiquity", the core genome definition has thus been pointed out as too conservative for being useful (Tonder et al. 2014). As a consequence, (AcevedoRocha et al. 2013) propose to rather focus on "persistent" genes, namely genes that are conserved in a majority of genomes. Some equivalent words to 'persistent' have also been introduced as 'soft core' (ContrerasMoreira et al. 2013) or 'extended core' (Lapierre et al. 2009, Bolotin et al. 2017), 'stabilome' (Vesth et al. 2010). This definition advocates for the use of a threshold on the frequency of appearance of a gene among the species of a clade, above which the gene is declared as a persistent one (generally gene families present at least in a range comprised between 90% and 95%). This approach gives an attractive answer to the issues raised by the original definition of the core genome but nevertheless has its own disadvantage that lies in choosing an appropriate threshold. Moreover, the usual dichotomy between core and accessory genome does not faithfully report the diverse ranges of gene frequencies in a pangenome. The gene frequency distribution in pangenomes is extensively documented (Lapierre et al. 2009, Collins et al. 2012, Lobkovsky et al. 2013, Bolotin et al. 2015, Bolotin et al. 2017). These studies argue for the existence of an equilibrium between genes acquisition and genes loss leading to an asymmetric U-shaped distribution of gene frequencies regardless of the phylogenetic level and the clade considered (with the exception of the non-homogeneous species (Moldovan et al. 2018)). The U left, bottom and right sides correspond respectively to the rare, moderately present and highly frequent gene families. Thereby, as proposed by (Koonin,2008) and formally modeled by (Collins et al. 2012), the pangenome can be split into 3 groups.This choice helps to shed light on genes putatively associated with positive environmental adaptations while avoiding to confound them with potentially randomly acquired ones. For that purpose, the partitioning approach that we propose here divides the pangenome into (1) persistent genome, equivalent to a relaxed core genome (genes conserved in almost all genomes); (2) shell genome, genes having intermediate frequencies corresponding to moderately conserved genes potentially associated to environmental adaptation capabilities; (3) cloud genome, genes found at a very low frequency. We tackle this challenge in the present work by first proposing a method to select this threshold automatically. Beyond the partitioning approach, the technological shifts of the sequencing methods offer us thousands of genome strains available in databases for numerous bacterial species. The processing of so many genomes poses a critical computational problem because it is no longer possible to handle comparative genomics studies as in the 90's, even with modern computing facilities. For instance, studying patterns of gene gains and losses in the evolution of a lineage is a basic question in comparative genomics but this task becomes tremendously harder when thousands of genomes have to be analyzed. Nevertheless, the information encoded in these genomes is highly redundant making it possible to design new compact ways of representing and manipulating this information. As suggested (Chan et al. 2015 and Marshall et al. 2016), a consensus representation of multiple genomes would provide a better analytical framework than using individual reference genomes. This proposition leads to a paradigm shift from the usual linear representation of reference genomes to a representation as pangenome graphs bringing together all the different known variations as multiple alternative paths. Some approaches have been developed aiming at factorizing pangenomes at the sequence level (PanCake : Ernst et al. 2013, SplitMEM : Marcus et al. 2014). However, these approaches lack direct information about genes, complicating the functional analyses from the study of the graph. Here, we introduce an extension of the concept of pangenome graph, giving it a formal mathematical representation using a graph model in which nodes represent gene families and chromosomal neighborhood information, respectively. The method introduced here can be considered as the missing link between the usual pangenomics approach (set of unlinked gene families) and the pangenome graph at the sequence level. A detailed comparison of these 2 approaches has been reviewed in (Zekic et al. 2018). Coupled with our partitioning method, this representation could be a new standard to depict all the genomic combinations of bacterial species in a single figure. Overview of the method : First, the genomes of the same species (or species cluster) are annotated before bringing homologue genes together into gene families via a all vs all protein alignment. From this data, the PPanGGOLiN method merges the chromosomal links between neighboring genes to build a graph of the neighborhood between gene families weighted by the number of genomes covering each edge. In parallel, the pangenome is modeled as a binary presence/absence matrix where the rows correspond to gene families and columns to the organisms (1 in case of presence of at least one gene belonging to this gene family, 0 in case of absence). The pangenome is then partitioned into the persistent, shell and cloud partitions by evaluating, through an Expectation-Maximisation algorithm, the best parameters of a Bernoulli Mixture Model (BMM) smoothed using a Markov Random Field (MRF) (Ambroise et al. 1997). For each partition, the BMM is composed of one mean vector of presence/absence (expected to be (11...11) for the persistent, (00...00) for the cloud and diversified for the shell) associated to a dispersion vector around the mean vector (low dispersion for the persistent and the cloud; high dispersion for the shell). Once the parameters are estimated, each gene family is associated to its closest partition according to its mean vector. As it is known that core gene families share conserved genomic organizations along genomes (Fang et al. 2008), the MRF imposes that two neighboring gene families are more likely to belong to the same partition. Therefore, the MFR penalizes unreliable partition attributed to the families compared to the partition of its neighbors in the graph (the weights of the edges account in the process). The algorithm iterates between BBM and MRF until the maximization of the overall likelihood. The strength of the topological smoothing is managed via a parameter called ÎČ (if ÎČ = 0, the smoothing is disabled and the partitioning only relies on the presence/absence matrix). At the end, the partitions are then overlaid on the neighborhood graph in order to obtain what we called the Partitioned Pangenome Graph. Thanks to this graphical structure and the associated statistical model, the pangenome is resilient to randomly distributed errors (e.g. an assembly gap in one genome can be offset by information from other genomes, thus maintaining the link in the graph).Conclusion:Due to the significant decreasing cost of recent sequencing technologies, the past recent years have seen the explosion of whole-genome sequencing projects (WGS), most notably for pathogenic bacteria. Using portable sequencer like ONT MinION, it is soon imaginable to obtain thousands of strains for each species because of the simplicity to sequence bacteria directly on the field. Therefore, the capture of all genomic variations of a species is no longer a wishful thinking. Before the emergence of the pangenomics, the emphasis has been on identifying polymorphism information to draw some sort of epidemiological map of the lineage(s) of interest. While this has resulted in the remarkably detailed information of epidemic strains, it is rapidly showing its major weakness since the analysis of the core genes actually provides very little information on the adaptive changes because most of them arise in the shell and cloud genomes. The approach presented here sheds light on these variations to focus on the gene gains and losses that are associated with these adaptive changes in a species. In the context of comparative genomics, drawing genomes on rails like a subway map may help biologist to compare genomes of interest to the overall pangenomic diversity. This graph-based approach to represent and manipulate pangenomes provides efficient bases for very large scale comparative genomics. The method is available as a standalone tool (https://github.com/ggautreau/PPanGGOLiN) and, as mentioned in (Vallenet et al. 2017), we are currently working on its integration in the MicroScope platform

    Web development, expertise of the Web INTEgRation group in the Bioinformatic and Biostatistics Hub

    No full text
    International audienceThe expertise group named WINTER, for Web INTEgRation, is a software development team whose aim is to build websites and apps using Web technologies. Like all members of Bioinformatics and Biostatistics Hub, the team provides support to Institut Pasteur’s research units and platforms for publishing and sharing analysis, data, scientific tools, and workflows. Several competences and informatic languages are necessary to create a complete user-friendly web tool. The web development process follows some essential steps: - Design and create the interface model considering user experience (UX) and user interface (UI) aspects, according to user needs. - Develop the front-end, which is the visible part of the website, involving the implementation of the interface code from the previously designed model, - Develop the back-end, the hidden part of website, to carry out actions enabled by the interface, and to store and manage the data in database. - Maintain and deploy the website through continuous integration and continuous delivery/deployment as well as versioning. The Winter Team consists of several members who are experts in web development. We provide our expertise to the scientists on campus, covering a broad range of services to design, develop, deploy, maintain web interface and databases creating fully functional dedicated tools for scientific topics. Depending on the project’s needs and size, Winter team members can work alone as full-stack DevOps members or contribute specifically to a stage as domain experts (such as UX/UI Designer, Front-end Developer or Back-end Developer), in close collaboration with the other team members. Thanks to projects covering a wide variety of scientific topics such as Structural Bioinformatics, Transcriptomics, Statistical Genetics, Antibiotic Resistance, Phylogeny, and Oncology, we collaborate with several research units and teams on campus, as well as other groups within the Hub, with the support of the IT department. Over the past years, the team has created more than 15 web applications and databases and has participated in various projects, including: - ABSD - AMR Spread - Bioflow-Insight - DefenseFinder - InDeep-net - JASS Additionally, our group is involved in external collaborations with national bioinformatic partners such as the “French Institute of Bioinformatics” (e.g. participation and leadership in WP6 of ABRomics project) and European partners (e.g development of oncodash software in DECIDER project) We oversee the Galaxy server of the Institut Pasteur, an integration platform to publish and use bioinformatics tools and workflows in a web interface

    Web development, expertise of the Web INTEgRation group in the Bioinformatic and Biostatistics Hub

    No full text
    International audienceThe expertise group named WINTER, for Web INTEgRation, is a software development team whose aim is to build websites and apps using Web technologies. Like all members of Bioinformatics and Biostatistics Hub, the team provides support to Institut Pasteur’s research units and platforms for publishing and sharing analysis, data, scientific tools, and workflows. Several competences and informatic languages are necessary to create a complete user-friendly web tool. The web development process follows some essential steps: - Design and create the interface model considering user experience (UX) and user interface (UI) aspects, according to user needs. - Develop the front-end, which is the visible part of the website, involving the implementation of the interface code from the previously designed model, - Develop the back-end, the hidden part of website, to carry out actions enabled by the interface, and to store and manage the data in database. - Maintain and deploy the website through continuous integration and continuous delivery/deployment as well as versioning. The Winter Team consists of several members who are experts in web development. We provide our expertise to the scientists on campus, covering a broad range of services to design, develop, deploy, maintain web interface and databases creating fully functional dedicated tools for scientific topics. Depending on the project’s needs and size, Winter team members can work alone as full-stack DevOps members or contribute specifically to a stage as domain experts (such as UX/UI Designer, Front-end Developer or Back-end Developer), in close collaboration with the other team members. Thanks to projects covering a wide variety of scientific topics such as Structural Bioinformatics, Transcriptomics, Statistical Genetics, Antibiotic Resistance, Phylogeny, and Oncology, we collaborate with several research units and teams on campus, as well as other groups within the Hub, with the support of the IT department. Over the past years, the team has created more than 15 web applications and databases and has participated in various projects, including: - ABSD - AMR Spread - Bioflow-Insight - DefenseFinder - InDeep-net - JASS Additionally, our group is involved in external collaborations with national bioinformatic partners such as the “French Institute of Bioinformatics” (e.g. participation and leadership in WP6 of ABRomics project) and European partners (e.g development of oncodash software in DECIDER project) We oversee the Galaxy server of the Institut Pasteur, an integration platform to publish and use bioinformatics tools and workflows in a web interface
    corecore