Search CORE

1,499 research outputs found

SOPRA: Scaffolding algorithm for paired reads via statistical optimization

Author: Dayarian Adel
Michael Todd P
Sengupta Anirvan M
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, <it>de novo </it>assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome. Results We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors. Conclusions Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.</p

Springer - Publisher Connector

Directory of Open Access Journals

Leveraging Identity-by-Descent for Accurate Genotype Inference in Family Sequencing Data

Author: Chen W
Haines J
Li B
Li C
Wei Q
Zhan X
Zhong X
Publication venue
Publication date: 01/07/2015
Field of study

Sequencing family DNA samples provides an attractive alternative to population based designs to identify rare variants associated with human disease due to the enrichment of causal variants in pedigrees. Previous studies showed that genotype calling accuracy can be improved by modeling family relatedness compared to standard calling algorithms. Current family-based variant calling methods use sequencing data on single variants and ignore the identity-by-descent (IBD) sharing along the genome. In this study we describe a new computational framework to accurately estimate the IBD sharing from the sequencing data, and to utilize the inferred IBD among family members to jointly call genotypes in pedigrees. Through simulations and application to real data, we showed that IBD can be reliably estimated across the genome, even at very low coverage (e.g. 2X), and genotype accuracy can be dramatically improved. Moreover, the improvement is more pronounced for variants with low frequencies, especially at low to intermediate coverage (e.g. 10X to 20X), making our approach effective in studying rare variants in cost-effective whole genome sequencing in pedigrees. We hope that our tool is useful to the research community for identifying rare variants for human disease through family-based sequencing

Recommended from our members

Computational methods for understanding genetic variations from next generation sequencing data

Author: Ahn Soyeon, Ph. D.
Publication venue
Publication date: 13/09/2018
Field of study

Studies of human genetic variation reveal critical information about genetic and complex diseases such as cancer, diabetes and heart disease, ultimately leading towards improvements in health and quality of life. Moreover, understanding genetic variations in viral population is of utmost importance to virologists and helps in search for vaccines. Next-generation sequencing technology is capable of acquiring massive amounts of data that can provide insight into the structure of diverse sets of genomic sequences. However, reconstructing heterogeneous sequences is computationally challenging due to the large dimension of the problem and limitations of the sequencing technology.This dissertation is focused on algorithms and analysis for two problems in which we seek to characterize genetic variations: (1) haplotype reconstruction for a single individual, so-called single individual haplotyping (SIH) or haplotype assembly problem, and (2) reconstruction of viral population, the so-called quasispecies reconstruction (QSR) problem. For the SIH problem, we have developed a method that relies on a probabilistic model of the data and employs the sequential Monte Carlo (SMC) algorithm to jointly determine type of variation (i.e., perform genotype calling) and assemble haplotypes. For the QSR problem, we have developed two algorithms. The first algorithm combines agglomerative hierarchical clustering and Bayesian inference to reconstruct quasispecies characterized by low diversity. The second algorithm utilizes tensor factorization framework with successive data removal to reconstruct quasispecies characterized by highly uneven frequencies of its components. Both algorithms outperform existing methods in both benchmarking tests and real data.Electrical and Computer Engineerin

Texas ScholarWorks

Pool-seq analysis for the identiﬁcation of polymorphisms in bacterial strains and utilization of the variants for protein database creation

Author: Weldatsadik Rigbe
Publication venue
Publication date: 27/10/2016
Field of study

Pooled sequencing (Pool-seq) is the sequencing of a single library that contains DNA pooled from diﬀerent samples. It is a cost-eﬀective alternative to individual whole genome sequencing. In this study, we utilized Pool-seq to sequence 100 streptococcus pyogenes strains in two pools to identify polymorphisms and create variant protein databases for shotgun proteomics analysis. We investigated the eﬃcacy of the pooling strategy and the four tools used for variant calling by using individual sequence data of six of the strains in the pools as well as 3407 publicly available strains from the European Nucleotide Archive. Besides the raw sequence data from the public repository, we also extracted polymorphisms from 19 S.pyogenes publicly available complete genomes and compared the variations against our pools. In total 78955 variants (76981 SNPs and 1725 INDELs ) were identiﬁed from the two pools. Of these, ∼ 60.5% and 95.7% were discovered in the complete genomes and the European Nucleotide Archive data respectively. Collectively, the four variant calling tools were able to mine majority of the variants, ∼ 96.5%, found from the six individual strains, suggesting Pool-seq is a robust approach for variation discovery. Variants from the pools that fell in coding regions and had non synonymous eﬀects constituted 24% and were used to create variant protein databases for shotgun proteomics analysis. These variant databases improved protein identiﬁcation in mass spectrometry analysis

Environmental DNA metabarcoding:transforming how we survey animal and plant communities

Author: Altermatt Florian
Bik Holly M.
Bista Iliana
Creer Simon
Deiner Kristy
Lacoursiere-Roussel Anais
Lodge David M.
Machler Elvira
Seymour Mathew
Publication venue: 'Wiley'
Publication date: 01/11/2017
Field of study

Environmental DNA metabarcoding:Transforming how we survey animal and plant communities

Author: Altermatt Florian
Bernatchez Louis
Bik Holly M.
Bista Iliana
Creer Simon
De Vere Natasha
Deiner Kristy
Lacoursière-Roussel Anaïs
Lodge David M.
Mächler Elvira
Pfrender Michael E.
Seymour Mathew
Publication venue
Publication date: 01/11/2017
Field of study

The genomic revolution has fundamentally changed how we survey biodiversity on earth. High-throughput sequencing (?HTS?) platforms now enable the rapid sequencing of DNA from diverse kinds of environmental samples (termed ?environmental DNA? or ?eDNA?). Coupling HTS with our ability to associate sequences from eDNA with a taxonomic name is called ?eDNA metabarcoding? and offers a powerful molecular tool capable of noninvasively surveying species richness from many ecosystems. Here, we review the use of eDNA metabarcoding for surveying animal and plant richness, and the challenges in using eDNA approaches to estimate relative abundance. We highlight eDNA applications in freshwater, marine and terrestrial environments, and in this broad context, we distill what is known about the ability of different eDNA sample types to approximate richness in space and across time. We provide guiding questions for study design and discuss the eDNA metabarcoding workflow with a focus on primers and library preparation methods. We additionally discuss important criteria for consideration of bioinformatic filtering of data sets, with recommendations for increasing transparency. Finally, looking to the future, we discuss emerging applications of eDNA metabarcoding in ecology, conservation, invasion biology, biomonitoring, and how eDNA metabarcoding can empower citizen science and biodiversity educationpublishersversionPeer reviewe

Repository for Publications and Research Data

Investigating the bioavailability of dissolved organic nitrogen using freshwater algae

Author: Bayliss Catherine E
Publication venue
Publication date: 02/12/2021
Field of study

Optimazation of marine sediments characterization via statistical analysis

Author: Collico Stefano
Publication venue: Universitat Politècnica de Catalunya
Publication date: 13/06/2022
Field of study

The task of geotechnical site characterization includes defining the layout of ground units and establishing their relevant engineering properties. This is an activity in which uncertainties of different nature (inherent, experimental, of interpretation…) are always present and in which the amount and characteristics of available data are highly variable. Probabilistic methodologies are applied to assess and manage uncertainties. A Bayesian perspective of probability, that roots probability on belief, is well suited for geotechnical characterization problems, as it has flexibility to handle different kind of uncertainties and highly variable datasets –in quality and quantity. This thesis addresses different topics of geotechnical site characterization from a probabilistic perspective, with emphasis on offshore investigation, on the Cone Penetration Test (CPTu) and on Bayesian methodologies.The first topic addresses soil layer delineation based on CPT(u) data. The starting point is the recognition that layer delineation is problem-oriented and has a strong subjective component. We propose a novel CPTu record analysis methodology which aims to: a) elicit the heuristics that intervene in layer delineation, facilitating communication and coherence in interpretation b) facilitate probabilistic characterization of the identified layers c) is simple and intuitive to use. The method is based on sequential distribution fitting in conventionally accepted classification spaces (Soil Behavior Type charts). The proposed technique is applied at different sites, illustrating how it can be related to borehole observations, how it compares with alternative methodologies and how it can be extended to create cross-site profiles. The second topic addresses strain-rate corrections of dynamic CPTu data. Dynamic CPTu impact on the seafloor and are very agile characterization tools. However, they require transformation to equivalent quasi-static results that can be conventionally interpreted. Up to now the necessary corrections are either too vague or require the acquisition of paired dynamic and quasi-static CPTu records (i.e., same location’s acquisition). A Bayesian methodology is applied to derive strain-rate coefficients in a more general setting, one in which some quasi-static CPTu records are available in the study area, but they need not be paired to any converted dynamic CPTu. Application to a case study offshore Nice shows that the results match those obtained using paired tests. Furthermore, strain rate correction coefficients and transformed quasi-static profiles are expressed in probabilistic terms.The third topic addressed is the optimization of soil unit weight prediction from CPTu readings. A Bayesian Mixture Analysis is applied to a global database to identify hidden soil classes within it. The goal is to improve the accuracy of regressions between geotechnical parameters obtained by exploiting the database. The method is applied to predict soil unit weight from CPTu data, a problem that has intrinsic practical interest but it is also representative of difficulties faced by a larger class of problems in geotechnical regression. Results highlight a decrease of systematic transformation uncertainty and an improve of accuracy of soil unit weight prediction from CPTu at new sites. In a final application we present a probabilistic earthquake-induced landslide susceptibility map of the South-West Iberian margin. A simplified geotechnical pixel-based slope stability model is considered to whole study area within which the key stability model parameters are treated as random variables. Site characterization at the regional scale combines a global database with available geotechnical data through a Bayesian scheme. Outputs (landslide susceptibility maps) are derived from a reliability-based design procedure (Montecarlo simulations) providing a robust landslide susceptibility prediction at the site according to Receiver Operating Curve (ROC).La caracterización geotécnica de un emplazamiento incluye la definición de la disposición de las unidades de suelo y el establecimiento de sus propiedades de ingeniería relevantes. Es una actividad en la que siempre están presentes incertidumbres y en la que la cantidad y las caracteristicas de los datos disponibles son muy variables. Para evaluar y gestionar las incertidumbres se aplican metodologías probabilísticas. Una perspectiva bayesiana de la probabilidad es muy adecuada para la caracterización geotécnica, ya que tiene flexibilidad para manejar incertidumbres y datos muy variables. Esta tesis aborda diferentes temas de caracterización geotécnica desde una perspectiva probabilística, con énfasis en la investigación en alta mar, en el ensayo de penetración de cono (CPTu) y en las metodologías bayesianas El primer tema aborda la delineación de la capa de suelo basada en los datos CPT(u). El punto de partida es el reconocimiento de que la delineación de capas tiene un fuerte componente subjetivo. Proponemos una novedosa metodología de análisis de registros CPTu que tiene como objetivo: a) expresar la heurística que interviene en la delineación de capas, facilitando la comunicación en la su interpretación b) facilitar la caracterización probabilística de las capas identificadas c) uso sencillo e intuitivo. El método se basa en el ajuste de distribuciones secuenciales en espacios de clasificación (tablas de comportamiento del suelo). La técnica propuesta se aplica en diferentes emplazamientos, ilustrando cómo puede relacionarse con sondeos, cómo se compara con metodologías alternativas y cómo puede ampliarse para crear perfiles entre emplazamientos. El segundo tema aborda las correcciones de la velocidad de deformación de los datos del CPTu dinámico (que impactan en el fondo marino y son herramientas de caracterización muy ágiles). Sin embargo, requieren una transformación a resultados equivalentes que puedan ser interpretados convencionalmente. Hasta ahora las correcciones necesarias son vagas o requieren la adquisición de CPTu dinámicos y cuasi-estáticos emparejados. Se aplica una metodologia bayesiana para derivar los coeficientes de velocidad de deformación en un entorno más general, en el que se dispone de algunos registros de CPTu cuasiestáticos en la zona de estudio, pero no es necesario emparejarlos con ningún CPTu dinámico convertido. La aplicación a un estudio de caso en el mar de Niza muestra que los resultados coinciden con los obtenidos mediante pruebas emparejadas. El tercer tema abordado es la optimización de la predicción del peso unitario del suelo a partir de las lecturas del CPTu. Se aplica un análisis de mezclas bayesiano a una base de datos global para identificar las clases de suelo ocultas en ella. El objetivo es mejorar la precisión de las regresiones entre los parámetros geotécnicos obtenidos explotando la base de datos. El método se aplica a la predicción del peso unitario del suelo a partir de los datos del CPTu. Los resultados destacan una disminución de la incertidumbre sistemática de la transformación y una mejora de la precisión de la predicción del peso unitario del suelo a partir de CPTu en nuevos sitios. En una aplicación final presentamos un mapa probabilistico de susceptibilidad a los deslizamientos de tierra inducidos por terremotos en el margen suroeste de la Península Ibérica. Se considera un modelo geotécnico simplificado de estabilidad de laderas basado en píxeles para toda el área de estudio, dentro del cual los parámetros clave del modelo de estabilidad se tratan como variables aleatorias. La caracterización a escala regional combina una base de datos global con los datos geotécnicos disponibles mediante un esquema bayesiano. Mapas de susceptibilidad a los corrimientos de tierra se derivan de un procedimiento de diseño basado en la fiabilidad que proporciona una predicción robusta de la susceptibilidad a deslizamientos de tierra en el sitio de acuerdo con la curva operativa del receptor (ROC).Postprint (published version

Monte-Carlo event generation for the LHC

Author: SIEGERT FRANK
Publication venue
Publication date: 01/01/2010
Field of study

This thesis discusses recent developments for the simulation of particle physics in the light of the start-up of the Large Hadron Collider. Simulation programs for fully exclusive events, dubbed Monte-Carlo event generators, are improved in areas related to the perturbative as well as non-perturbative regions of strong interactions. A short introduction to the main principles of event generation is given to serve as a basis for the following discussion. An existing algorithm for the correction of parton-shower emissions with the help of exact tree-level matrix elements is revisited and significantly improved as attested by first results. In a next step, an automated implementation of the POWHEG method is presented. It allows for the combination of parton showers with full next-to-leading order QCD calculations and has been tested in several processes. These two methods are then combined into a more powerful framework which allows to correct a parton shower with full next-to-leading order matrix elements and higher-order tree-level matrix elements at the same time. Turning to the non-perturbative aspects of event generation, a tuning of the Pythia event generator within the Monte-Carlo working group of the ATLAS experiment is presented. It is based on early ATLAS minimum bias measurements obtained with minimal model dependence. The parts of the detector relevant for these measurements are briefly explained. Throughout the thesis, results obtained with the improvements are compared to experimental measurements

Durham e-Theses

Human genetics and genomics a decade after the release of the draft sequence of the human genome

Author: Cooper D.N.
Ku C.-S.
Naidoo N.
Pawitan Y.
Soong R.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

10.1186/1479-7364-5-6-577Human Genomics56577-62

Directory of Open Access Journals