1,659 research outputs found

    A FAST IMPLEMENTATION FOR CORRECTING ERRORS IN HIGH THROUGHPUT SEQUENCING DATA

    Get PDF
    ABSTRACT The impact of the next generation DNA sequencing technologies (NGS) produced a revolu­tion in biological research. New computational tools are needed to deal with the huge amounts of data they output. Significantly shorter length of the reads and higher per-base error rate compared with Sanger technology make things more difficult and still critical problems, such as genome assembly, are not satisfactorily solved. Significant efforts have been spent recently on software programs aimed at increasing the quality of the NGS data by correcting errors. The most accurate program to date is HiTEC and our contribution is providing a completely new implementation, HiTEC2. The new program is many times faster and uses much less space, while correcting more errors in the same number of iterations. We have eliminated the need of the suffix array data structure and the need of installing complicating statistical libraries as well, thus making HiTEC2 not only more efficient but also friendlier

    HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data

    Get PDF
    As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.National Science Foundation (U.S.) (NSF/NIH BIGDATA Grant R01GM108348-01)National Science Foundation (U.S.) (Graduate Research Fellowship)Simons Foundatio

    Methods for Viral Intra-Host and Inter-Host Data Analysis for Next-Generation Sequencing Technologies

    Get PDF
    The deep coverage offered by next-generation sequencing (NGS) technology has facilitated the reconstruction of intra-host RNA viral populations at an unprecedented level of detail. However, NGS data requires sophisticated analysis dealing with millions of error-prone short reads. This dissertation will first review the challenges and methods for viral NGS genomic data analysis in the NGS era. Second, it presents a software tool CliqueSNV for inferring viral quasispecies based on extracting pairs of statistically linked mutations from noisy reads, which effectively reduces sequencing noise and enables identifying minority haplotypes with a frequency below the sequencing error rate. Finally, the dissertation describes algorithms VOICE and MinDistB for inference of relatedness between viral samples, identification of transmission clusters, and sources of infection

    Algorithms for analysis of next-generation viral sequencing data

    Get PDF
    RNA viruses mutate at extremely high rates, forming an intra-host viral population of closely related variants, which allows them to evade the host’s immune system and makes them particularly dangerous. Viral outbreaks pose a significant threat for public health. Progress of sequencing technologies made it possible to identify and sample intra-host viral populations at great depth. Consequently, the contribution of sequencing technologies to molecular surveillance of viral outbreaks becomes more and more substantial. Genome sequencing of viral populations reveals similarities between samples, allows to measure viral genetic distance and facilitate outbreak identification and isolation. Computational methods can be used to infer transmission characteristics from sequencing data. However, due to the specifics of next-generation sequencing (NGS) approaches, and the limited availability of viral data, existing methods lack accuracy and efficiency. In this dissertation, I present a novel, flexible methods, that allow tackling crucial epidemiological problems, such as identification of transmission clusters, sources of infection, and transmission direction

    Viral quasispecies diversity and evolution : a Bioinformatics molecular approach

    Get PDF
    El grup de hepatitis virals del Vall d'Hebron Institut de Recerca (VHIR) de Barcelona, en els darrers 10 anys, ha estat desenvolupant solucions metodològiques experimentals i computacionals per a l'estudi de poblacions complexes de virus (quasispecies) mitjançant l'aplicació de les tècniques de seqüenciació de nova generació (NGS). Aquest llibre consisteix en una selecció de treballs empírics sobre les quasispecies virals. En oferir aquest format obert de publicació, l'objectiu és que, per una banda, pugui ser una eina útil per a tots aquells investigadors interessats en aquest camp i, d'altra banda, divulgar aquesta àrea de coneixement a tota la comunitat científica que, sense ser necessàriament experta, vulgui conèixer amb més detall l'evolució i diversitat dels virus. En els tres primers treballs s'aprofundeix en l'ús, interpretació i utilitat d'índexs de biodiversitat, alguns específics per a poblacions genètiques i d'altres importats del camp de l'ecologia. La segona part posa de manifest algunes limitacions en aquests índexs de diversitat i aborda el desenvolupament d'eines integradores que proporcionen una interpretació més directa en termes biològics i clínics. Les seccions prèvies als sis treballs esmentats, situen al lector en el context en què es realitzen els desenvolupaments i expliquen la necessitat i utilitat. El llibre es tanca amb una secció que recull les observacions i conclusions generals dels treballs, i amb una altra que reflexiona sobre les limitacions que comporta l'estudi de sistemes complexos i dinàmics com les quasispecies virals

    Technology dictates algorithms: Recent developments in read alignment

    Full text link
    Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data in the form of nucleotide sequences or reads. Aligning reads onto reference genomes enables the identification of individual-specific genetic variants and is an essential step of the majority of genomic analysis pipelines. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Importantly, computational algorithms have evolved and diversified in accordance with technological advances, leading to todays diverse array of bioinformatics tools. Our review provides a survey of algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2020, for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies

    Toward Early Detection Of Pancreatic Cancer: An Evidence-Based Approach

    Get PDF
    This study observes how an evidential reasoning approach can be used as a diagnostic tool for early detection of pancreatic cancer. The evidential reasoning model combines the output of a linear Support Vector Classifier (SVC) with factors such as smoking history, health history, biopsy location, NGS technology used, and more to predict the likelihood of the disease. The SVC was trained using genomic data of pancreatic cancer patients derived from the National Cancer Institute (NIH) Genomic Data Commons (GDC). To test the evidential reasoning model, a variety of synthetic data was compiled to test the impact of combinations of different factors. Through experimentation, we monitored how the evidential interval for pancreatic cancer fluctuated based on the inputs that were provided. We observed how the pancreatic cancer evidential interval increased and the machine learning prediction of pancreatic cancer was supported when the input changed from a non-smoker and non-drinker to an individual with a highly active smoking and drinking history. Similarly, we observed how the evidential interval for pancreatic cancer increased significantly when the machine learning prediction for pancreatic cancer was maintained as high and the input of the quality of the sequencing read was changed from a high quantity of cytosine guanine content and homopolymer regions to a moderate quantity of cytosine guanine content and low homopolymer regions; indicating that there was initially a higher likelihood of error in the sequencing reads, resulting in a more inaccurate machine learning output. This experiment shows that an evidence-based approach has the potential to contribute as a diagnostic tool for screening for high-risk groups. Future work should focus on improving the machine learning model by using a larger pancreatic cancer genomic database. Next steps will involve programmatically analyzing real sequencing reads for irregular guanine cytosine content and high homopolymer regions

    Gene Set Enrichment and Projection: A Computational Tool for Knowledge Discovery in Transcriptomes

    Get PDF
    Explaining the mechanism behind a genetic disease involves two phases, collecting and analyzing data associated to the disease, then interpreting those data in the context of biological systems. The objective of this dissertation was to develop a method of integrating complementary datasets surrounding any single biological process, with the goal of presenting the response to a signal in terms of a set of downstream biological effects. This dissertation specifically tests the hypothesis that computational projection methods overlaid with domain expertise can direct research towards relevant systems-level signals underlying complex genetic disease. To this end, I developed a software algorithm named Geneset Enrichment and Projection Displays (GSEPD) that can visualize multidimensional genetic expression to identify the biologically relevant gene sets that are altered in response to a biological process. This dissertation highlights a problem of data interpretation facing the medical research community, and shows how computational sciences can help. By bringing annotation and expression datasets together, a new analytical and software method was produced that helps unravel complicated experimental and biological data. The dissertation shows four coauthored studies where the experts in their field have desired to annotate functional significance to a gene-centric experiment. Using GSEPD to show inherently high dimensional data as a simple colored graph, a subspace vector projection directly calculated how each sample behaves like test conditions. The end-user medical researcher understands their data as a series of somewhat-independent subsystems, and GSEPD provides a dimensionality reduction for high throughput experiments of limited sample size. Gene Ontology analyses are accessible on a sample-to-sample level, and this work highlights not just the expected biological systems, but many annotated results available in vast online databases

    Design and characterization of novel immunogens for AIDS vaccine development and evaluation of a sample inference method for NGS Illumina amplicon data

    Get PDF
    Since the beginning of the AIDS pandemic, an estimated 78 million people have become infected and 35 million people have died from AIDS-related illnesses. Despite the existence of effective antiretroviral therapy, 1.1 million people died of AIDS-related causes in 2015. A vaccine that could induce broadly neutralizing antibodies (bnAbs) is hypothesized to be the most efficient way to halt the AIDS pandemic. However, the majority of attempts to elicit bnAbs with HIV-1 vaccine candidates have failed due to the extensive variability and complex immune-evasion strategies of HIV-1. Recent advances in the isolation of bnAbs from HIV-1 infected individuals have revived interest in vaccine development. The membrane proximal external region (MPER) of gp41 and the CD4 binding site (CD4bs) of gp120 have become attractive targets for vaccine development because they contain highly conserved epitopes recognized by some of the broadest neutralizing antibodies. Here, we have designed and characterized multiple immunogens and vaccine strategies to induce bnAbs targeted to MEPR or CD4bs. Our findings indicate that 1) neighboring domains influence the immunogenicity of gp41 MPER, and 2) priming with a small gp41 or gp120 immunogen, then subsequently boosting with larger and more native immunogens, may have the potential to elicit antibodies towards the appropriate neutralizing epitopes. Illumina amplicon sequencing is an important tool for the identification and quantification of species or variants in metagenomics studies, but sequencing errors make it challenging to correctly identify the authentic differences. Many denoising algorithms have been developed, but most ignore the quality scores or compress that data. We developed ampliclust, an error modeling approach using uncompressed sequences and quality scores to infer samples in Illumina amplicon data. Our approach showed better accuracy than the popular denoising tool DADA2 when data are not well separated
    • …
    corecore