1,659 research outputs found
A FAST IMPLEMENTATION FOR CORRECTING ERRORS IN HIGH THROUGHPUT SEQUENCING DATA
ABSTRACT
The impact of the next generation DNA sequencing technologies (NGS) produced a revoluÂtion in biological research. New computational tools are needed to deal with the huge amounts of data they output. Significantly shorter length of the reads and higher per-base error rate compared with Sanger technology make things more difficult and still critical problems, such as genome assembly, are not satisfactorily solved. Significant efforts have been spent recently on software programs aimed at increasing the quality of the NGS data by correcting errors. The most accurate program to date is HiTEC and our contribution is providing a completely new implementation, HiTEC2. The new program is many times faster and uses much less space, while correcting more errors in the same number of iterations. We have eliminated the need of the suffix array data structure and the need of installing complicating statistical libraries as well, thus making HiTEC2 not only more efficient but also friendlier
HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data
As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.National Science Foundation (U.S.) (NSF/NIH BIGDATA Grant R01GM108348-01)National Science Foundation (U.S.) (Graduate Research Fellowship)Simons Foundatio
Methods for Viral Intra-Host and Inter-Host Data Analysis for Next-Generation Sequencing Technologies
The deep coverage offered by next-generation sequencing (NGS) technology has facilitated the reconstruction of intra-host RNA viral populations at an unprecedented level of detail. However, NGS data requires sophisticated analysis dealing with millions of error-prone short reads. This dissertation will first review the challenges and methods for viral NGS genomic data analysis in the NGS era. Second, it presents a software tool CliqueSNV for inferring viral quasispecies based on extracting pairs of statistically linked mutations from noisy reads, which effectively reduces sequencing noise and enables identifying minority haplotypes with a frequency below the sequencing error rate. Finally, the dissertation describes algorithms VOICE and MinDistB for inference of relatedness between viral samples, identification of transmission clusters, and sources of infection
Algorithms for analysis of next-generation viral sequencing data
RNA viruses mutate at extremely high rates, forming an intra-host viral population of closely related variants, which allows them to evade the host’s immune system and makes them particularly dangerous. Viral outbreaks pose a significant threat for public health. Progress of sequencing technologies made it possible to identify and sample intra-host viral populations at great depth. Consequently, the contribution of sequencing technologies to molecular surveillance of viral outbreaks becomes more and more substantial. Genome sequencing of viral populations reveals similarities between samples, allows to measure viral genetic distance and facilitate outbreak identification and isolation. Computational methods can be used to infer transmission characteristics from sequencing data. However, due to the specifics of next-generation sequencing (NGS) approaches, and the limited availability of viral data, existing methods lack accuracy and efficiency. In this dissertation, I present a novel, flexible methods, that allow tackling crucial epidemiological problems, such as identification of transmission clusters, sources of infection, and transmission direction
Viral quasispecies diversity and evolution : a Bioinformatics molecular approach
El grup de hepatitis virals del Vall d'Hebron Institut de Recerca (VHIR) de Barcelona, en els darrers 10 anys, ha estat desenvolupant solucions metodològiques experimentals i computacionals per a l'estudi de poblacions complexes de virus (quasispecies) mitjançant l'aplicació de les tècniques de seqüenciació de nova generació (NGS). Aquest llibre consisteix en una selecció de treballs empÃrics sobre les quasispecies virals. En oferir aquest format obert de publicació, l'objectiu és que, per una banda, pugui ser una eina útil per a tots aquells investigadors interessats en aquest camp i, d'altra banda, divulgar aquesta à rea de coneixement a tota la comunitat cientÃfica que, sense ser necessà riament experta, vulgui conèixer amb més detall l'evolució i diversitat dels virus. En els tres primers treballs s'aprofundeix en l'ús, interpretació i utilitat d'Ãndexs de biodiversitat, alguns especÃfics per a poblacions genètiques i d'altres importats del camp de l'ecologia. La segona part posa de manifest algunes limitacions en aquests Ãndexs de diversitat i aborda el desenvolupament d'eines integradores que proporcionen una interpretació més directa en termes biològics i clÃnics. Les seccions prèvies als sis treballs esmentats, situen al lector en el context en què es realitzen els desenvolupaments i expliquen la necessitat i utilitat. El llibre es tanca amb una secció que recull les observacions i conclusions generals dels treballs, i amb una altra que reflexiona sobre les limitacions que comporta l'estudi de sistemes complexos i dinà mics com les quasispecies virals
Technology dictates algorithms: Recent developments in read alignment
Massively parallel sequencing techniques have revolutionized biological and
medical sciences by providing unprecedented insight into the genomes of humans,
animals, and microbes. Modern sequencing platforms generate enormous amounts of
genomic data in the form of nucleotide sequences or reads. Aligning reads onto
reference genomes enables the identification of individual-specific genetic
variants and is an essential step of the majority of genomic analysis
pipelines. Aligned reads are essential for answering important biological
questions, such as detecting mutations driving various human diseases and
complex traits as well as identifying species present in metagenomic samples.
The read alignment problem is extremely challenging due to the large size of
analyzed datasets and numerous technological limitations of sequencing
platforms, and researchers have developed novel bioinformatics algorithms to
tackle these difficulties. Importantly, computational algorithms have evolved
and diversified in accordance with technological advances, leading to todays
diverse array of bioinformatics tools. Our review provides a survey of
algorithmic foundations and methodologies across 107 alignment methods
published between 1988 and 2020, for both short and long reads. We provide
rigorous experimental evaluation of 11 read aligners to demonstrate the effect
of these underlying algorithms on speed and efficiency of read aligners. We
separately discuss how longer read lengths produce unique advantages and
limitations to read alignment techniques. We also discuss how general alignment
algorithms have been tailored to the specific needs of various domains in
biology, including whole transcriptome, adaptive immune repertoire, and human
microbiome studies
Toward Early Detection Of Pancreatic Cancer: An Evidence-Based Approach
This study observes how an evidential reasoning approach can be used as a diagnostic tool for early detection of pancreatic cancer. The evidential reasoning model combines the output of a linear Support Vector Classifier (SVC) with factors such as smoking history, health history, biopsy location, NGS technology used, and more to predict the likelihood of the disease. The SVC was trained using genomic data of pancreatic cancer patients derived from the National Cancer Institute (NIH) Genomic Data Commons (GDC). To test the evidential reasoning model, a variety of synthetic data was compiled to test the impact of combinations of different factors. Through experimentation, we monitored how the evidential interval for pancreatic cancer fluctuated based on the inputs that were provided. We observed how the pancreatic cancer evidential interval increased and the machine learning prediction of pancreatic cancer was supported when the input changed from a non-smoker and non-drinker to an individual with a highly active smoking and drinking history. Similarly, we observed how the evidential interval for pancreatic cancer increased significantly when the machine learning prediction for pancreatic cancer was maintained as high and the input of the quality of the sequencing read was changed from a high quantity of cytosine guanine content and homopolymer regions to a moderate quantity of cytosine guanine content and low homopolymer regions; indicating that there was initially a higher likelihood of error in the sequencing reads, resulting in a more inaccurate machine learning output. This experiment shows that an evidence-based approach has the potential to contribute as a diagnostic tool for screening for high-risk groups. Future work should focus on improving the machine learning model by using a larger pancreatic cancer genomic database. Next steps will involve programmatically analyzing real sequencing reads for irregular guanine cytosine content and high homopolymer regions
Gene Set Enrichment and Projection: A Computational Tool for Knowledge Discovery in Transcriptomes
Explaining the mechanism behind a genetic disease involves two phases, collecting and analyzing data associated to the disease, then interpreting those data in the context of biological systems. The objective of this dissertation was to develop a method of integrating complementary datasets surrounding any single biological process, with the goal of presenting the response to a signal in terms of a set of downstream biological effects. This dissertation specifically tests the hypothesis that computational projection methods overlaid with domain expertise can direct research towards relevant systems-level signals underlying complex genetic disease. To this end, I developed a software algorithm named Geneset Enrichment and Projection Displays (GSEPD) that can visualize multidimensional genetic expression to identify the biologically relevant gene sets that are altered in response to a biological process. This dissertation highlights a problem of data interpretation facing the medical research community, and shows how computational sciences can help. By bringing annotation and expression datasets together, a new analytical and software method was produced that helps unravel complicated experimental and biological data. The dissertation shows four coauthored studies where the experts in their field have desired to annotate functional significance to a gene-centric experiment. Using GSEPD to show inherently high dimensional data as a simple colored graph, a subspace vector projection directly calculated how each sample behaves like test conditions. The end-user medical researcher understands their data as a series of somewhat-independent subsystems, and GSEPD provides a dimensionality reduction for high throughput experiments of limited sample size. Gene Ontology analyses are accessible on a sample-to-sample level, and this work highlights not just the expected biological systems, but many annotated results available in vast online databases
Design and characterization of novel immunogens for AIDS vaccine development and evaluation of a sample inference method for NGS Illumina amplicon data
Since the beginning of the AIDS pandemic, an estimated 78 million people have become infected and 35 million people have died from AIDS-related illnesses. Despite the existence of effective antiretroviral therapy, 1.1 million people died of AIDS-related causes in 2015. A vaccine that could induce broadly neutralizing antibodies (bnAbs) is hypothesized to be the most efficient way to halt the AIDS pandemic. However, the majority of attempts to elicit bnAbs with HIV-1 vaccine candidates have failed due to the extensive variability and complex immune-evasion strategies of HIV-1. Recent advances in the isolation of bnAbs from HIV-1 infected individuals have revived interest in vaccine development. The membrane proximal external region (MPER) of gp41 and the CD4 binding site (CD4bs) of gp120 have become attractive targets for vaccine development because they contain highly conserved epitopes recognized by some of the broadest neutralizing antibodies. Here, we have designed and characterized multiple immunogens and vaccine strategies to induce bnAbs targeted to MEPR or CD4bs. Our findings indicate that 1) neighboring domains influence the immunogenicity of gp41 MPER, and 2) priming with a small gp41 or gp120 immunogen, then subsequently boosting with larger and more native immunogens, may have the potential to elicit antibodies towards the appropriate neutralizing epitopes.
Illumina amplicon sequencing is an important tool for the identification and quantification of species or variants in metagenomics studies, but sequencing errors make it challenging to correctly identify the authentic differences. Many denoising algorithms have been developed, but most ignore the quality scores or compress that data. We developed ampliclust, an error modeling approach using uncompressed sequences and quality scores to infer samples in Illumina amplicon data. Our approach showed better accuracy than the popular denoising tool DADA2 when data are not well separated
- …