178 research outputs found
Accurate determination of node and arc multiplicities in de Bruijn graphs using conditional random fields
Background: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times eachk-mer (resp.k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results: To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions: We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. Truek-mers can be distinguished from erroneousk-mers with a higher F(1)score than existing methods. A C++11 implementation is available atunder the GNU AGPL v3.0 license
Jabba: hybrid error correction for long sequencing reads using maximal exact matches
Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented
Speeding up Martins' algorithm for multiple objective shortest path problems
The latest transportation systems require the best routes in a large network with respect to multiple objectives simultaneously to be calculated in a very short time. The label setting algorithm of Martins efficiently finds this set of Pareto optimal paths, but sometimes tends to be slow, especially for large networks such as transportation networks. In this article we investigate a number of speedup measures, resulting in new algorithms. It is shown that the calculation time to find the Pareto optimal set can be reduced considerably. Moreover, it is mathematically proven that these algorithms still produce the Pareto optimal set of paths
Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields
Background: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times eachk-mer (resp.k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results: To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions: We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. Truek-mers can be distinguished from erroneousk-mers with a higher F(1)score than existing methods. A C++11 implementation is available atunder the GNU AGPL v3.0 license
OMSim : a simulator for optical map data
Motivation: The Bionano Genomics platform allows for the optical detection of short sequence patterns in very long DNA molecules (up to 2.5 Mbp). Molecules with overlapping patterns can be assembled to generate a consensus optical map of the entire genome. In turn, these optical maps can be used to validate or improve de novo genome assembly projects or to detect large-scale structural variation in genomes. Simulated optical map data can assist in the development and benchmarking of tools that operate on those data, such as alignment and assembly software. Additionally, it can help to optimize the experimental setup for a genome of interest. Such a simulator is currently not available.
Results: We have developed a simulator, OMSim, that produces synthetic optical map data that mimics real Bionano Genomics data. These simulated data have been tested for compatibility with the Bionano Genomics Irys software system and the Irys-scaffolding scripts. OMSim is capable of handling very large genomes (over 30 Gbp) with high throughput and low memory requirements
Jabba: hybrid error correction for long sequencing reads
Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.
Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented.
Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph
Full-scale modelling of an ozone reactor for drinking water treatment
In 2003, the Flemish Water Supply Company (VMW) extended its drinking water production site in Kluizen (near Ghent, Belgium) with a combined ozonation and biological granular activated carbon (BGAC) filtration process. Due to this upgrade, biostability increased, less chlorination was needed and drinking water quality improved significantly. The aim of this study was to describe the full-scale reactor with a limited set of equations. In order to describe the ozonation process, a model including key processes such as ozone decomposition, organic carbon removal, disinfection and bromate formation was developed. Kinetics were implemented in WEST® and simulation results were compared to real data. The predicting performance was verified with a goodness-of-fit test and key parameters were determined through a local sensitivity analysis. Parameters involving optical density (both rate constants and stoichiometric coefficients) strongly affect model output. Some parameters with respect to bromate and bacteria showed to be only, but to a large extent, sensitive to their associated concentrations. A scenario analysis was performed to study the system’s behavior at different operational conditions. It was demonstrated that the model is able to describe the operation of the full-scale ozone reactor, however, further data collection for model validation is necessary
Congolese rhizospheric soils as a rich source of new plant growth-promoting endophytic Piriformospora isolates
In the last decade, there has been an increasing focus on the implementation of plant growth-promoting (PGP) organisms as a sustainable option to compensate for poor soil fertility conditions in developing countries. Trap systems were used in an effort to isolate PGP fungi from rhizospheric soil samples collected in the region around Kisangani in the Democratic Republic of Congo. With sudangrass as a host, a highly conducive environment was created for sebacinalean chlamydospore formation inside the plant roots resulting in a collection of 51 axenically cultured isolates of the elusive genus Piriformospora (recently transferred to the genus Serendipita). Based on morphological data, ISSR fingerprinting profiles and marker gene sequences, we propose that these isolates together with Piriformospora williamsii constitute a species complex designated Piriformospora (= Serendipita) 'williamsii.' A selection of isolates strongly promoted plant growth of in vitro inoculated Arabidopsis seedlings, which was evidenced by an increase in shoot fresh weight and a strong stimulation of lateral root formation. This isolate collection provides unprecedented opportunities for fundamental as well as translational research on the Serendipitaceae, a family of fungal endophytes in full expansion
- …