206 research outputs found

    Manycore Algorithms for Genetic Linkage Analysis

    Get PDF
    Exact algorithms to perform linkage analysis scale exponentially with the size of the input. Beyond a critical point, the amount of work that needs to be done exceeds both available time and memory. In these circumstances, we are forced to either abbreviate the input in some manner or else use an approximation. Approximate methods, like Markov chain Monte Carlo (MCMC), though they make the problem tractable, can take an immense amount of time to converge. The problem of high convergence time is compounded by software which is single-threaded and, as computer processors are manufactured with increasing numbers of physical processing cores, are not designed to take advantage of the available processing power. In this thesis, we will describe our program SwiftLink that embodies our work adapting existing Gibbs samplers to modern computer processor architectures. The processor architectures we target are: multicore processors, that currently feature between 4–8 processor cores, and computer graphics cards (GPUs) that already feature hundreds of processor cores. We implemented parallel versions of the meiosis sampler, that mixes well with tightly linked markers but suffers from irreducibility issues, and the locus sampler which is guaranteed to be irreducible but mixes slowly with tightly linked markers. We evaluate SwiftLink’s performance on real-world datasets of large consanguineous families. We demonstrate that using four processor cores for a single analysis is 3–3.2x faster than the single-threaded implementation of SwiftLink. With respect to the existing MCMC-based programs: it achieves a 6.6–8.7x speedup compared to Morgan and a 66.4– 72.3x speedup compared to Simwalk. Utilising both a multicore processor and a GPU performs 7–7.9x faster than the single-threaded implementation, a 17.6–19x speedup compared to Morgan and a 145.5–192.3x speedup compared to Simwalk

    Phylogeny-Aware Placement and Alignment Methods for Short Reads

    Get PDF
    In recent years bioinformatics has entered a new phase: New sequencing methods, generally referred to as Next Generation Sequencing (NGS) have become widely available. This thesis introduces algorithms for phylogeny aware analysis of short sequence reads, as generated by NGS methods in the context of metagenomic studies. A considerable part of this work focuses on the technical (w.r.t. performance) challenges of these new algorithms, which have been developed specifically to exploit parallelism

    High-Order Epistasis Detection in High Performance Computing Systems

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] Nos últimos anos, os estudos de asociación do xenoma completo (Genome-Wide Association Studies, GWAS) están a gañar moita popularidade de cara a buscar unha explicación xenética á presenza ou ausencia de certas enfermidades nos humanos.Hai un consenso nestes estudos sobre a existencia de interaccións xenéticas que condicionan a expresión de enfermidades complexas, un fenómeno coñecido como epistasia. Esta tese céntrase no estudo deste fenómeno empregando a computación de altas prestacións (High-Performance Computing, HPC) e dende a súa perspectiva estadística: a desviación da expresión dun fenotipo como a suma dos efectos individuais de múltiples variantes xenéticas. Con este obxectivo desenvolvemos unha primeira ferramenta, chamada MPI3SNP, que identifica interaccións de tres variantes a partir dun conxunto de datos de entrada. MPI3SNP implementa unha busca exhaustiva empregando un test de asociación baseado na Información Mutua, e explota os recursos de clústeres de CPUs ou GPUs para acelerar a busca. Coa axuda desta ferramenta avaliamos o estado da arte da detección de epistasia a través dun estudo que compara o rendemento de vintesete ferramentas. A conclusión máis importante desta comparativa é a incapacidade dos métodos non exhaustivos de atopar interacción ante a ausencia de efectos marxinais (pequenos efectos de asociación das variantes individuais que participan na epistasia). Por isto, esta tese continuou centrándose na optimización da busca exhaustiva de epistasia. Por unha parte, mellorouse a eficiencia do test de asociación a través dunha implantación vectorial do mesmo. Por outro lado, creouse un algoritmo distribuído que implementa unha busca exhaustiva capaz de atopar epistasia de calquera orden. Estes dous fitos lógranse en Fiuncho, unha ferramenta que integra toda a investigación realizada, obtendo un rendemento en clústeres de CPUs que supera a todas as súas alternativas no estado da arte. Adicionalmente, desenvolveuse unha libraría para simular escenarios biolóxicos con epistasia chamada Toxo. Esta libraría permite a simulación de epistasia seguindo modelos de interacción xenética existentes para orde alto.[Resumen] En los últimos años, los estudios de asociación del genoma completo (Genome- Wide Association Studies, GWAS) están ganando mucha popularidad de cara a buscar una explicación genética a la presencia o ausencia de ciertas enfermedades en los seres humanos. Existe un consenso entre estos estudios acerca de que muchas enfermedades complejas presentan interacciones entre los diferentes genes que intervienen en su expresión, un fenómeno conocido como epistasia. Esta tesis se centra en el estudio de este fenómeno empleando la computación de altas prestaciones (High-Performance Computing, HPC) y desde su perspectiva estadística: la desviación de la expresión de un fenotipo como suma de los efectos de múltiples variantes genéticas. Para ello se ha desarrollado una primera herramienta, MPI3SNP, que identifica interacciones de tres variantes a partir de un conjunto de datos de entrada. MPI3SNP implementa una búsqueda exhaustiva empleando un test de asociación basado en la Información Mutua, y explota los recursos de clústeres de CPUs o GPUs para acelerar la búsqueda. Con la ayuda de esta herramienta, hemos evaluado el estado del arte de la detección de epistasia a través de un estudio que compara el rendimiento de veintisiete herramientas. La conclusión más importante de esta comparativa es la incapacidad de los métodos no exhaustivos de localizar interacciones ante la ausencia de efectos marginales (pequeños efectos de asociación de variantes individuales pertenecientes a una relación epistática). Por ello, esta tesis continuó centrándose en la optimización de la búsqueda exhaustiva. Por un lado, se mejoró la eficiencia del test de asociación a través de una implementación vectorial del mismo. Por otra parte, se diseñó un algoritmo distribuido que implementa una búsqueda exhaustiva capaz de encontrar relaciones epistáticas de cualquier tamaño. Estos dos hitos se logran en Fiuncho, una herramienta que integra toda la investigación realizada, obteniendo un rendimiento en clústeres de CPUs que supera a todas sus alternativas del estado del arte. A mayores, también se ha desarrollado una librería para simular escenarios biológicos con epistasia llamada Toxo. Esta librería permite la simulación de epistasia siguiendomodelos de interacción existentes para orden alto.[Abstract] In recent years, Genome-Wide Association Studies (GWAS) have become more and more popular with the intent of finding a genetic explanation for the presence or absence of particular diseases in human studies. There is consensus about the presence of genetic interactions during the expression of complex diseases, a phenomenon called epistasis. This thesis focuses on the study of this phenomenon, employingHigh- Performance Computing (HPC) for this purpose and from a statistical definition of the problem: the deviation of the expression of a phenotype from the addition of the individual contributions of genetic variants. For this purpose, we first developedMPI3SNP, a programthat identifies interactions of three variants froman input dataset. MPI3SNP implements an exhaustive search of epistasis using an association test based on the Mutual Information and exploits the resources of clusters of CPUs or GPUs to speed up the search. Then, we evaluated the state-of-the-art methods with the help of MPI3SNP in a study that compares the performance of twenty-seven tools. The most important conclusion of this study is the inability of non-exhaustive approaches to locate epistasis in the absence of marginal effects (small association effects of individual variants that partake in an epistasis interaction). For this reason, this thesis continued focusing on the optimization of the exhaustive search. First, we improved the efficiency of the association test through a vector implementation of this procedure. Then, we developed a distributed algorithm capable of locating epistasis interactions of any order. These two milestones were achieved in Fiuncho, a program that incorporates all the research carried out, obtaining the best performance in CPU clusters out of all the alternatives of the state-of-the-art. In addition, we also developed a library to simulate particular scenarios with epistasis called Toxo. This library allows for the simulation of epistasis that follows existing interaction models for high-order interactions

    xploring Genetic Interactions: from Tools Development with Massive Parallelization on GPGPU to Multi-Phenotype Studies on Dyslexia

    Get PDF
    Over a decade, genome-wide association studies (GWASs) have provided insightful information into the genetic architecture of complex traits. However, the variants found by GWASs explain just a small portion of heritability. Meanwhile, as large scale GWASs and meta-analyses of multiple phenotypes are becoming increasingly common, there is a need to develop computationally efficient models/tools for multi-locus studies and multi-phenotype studies. Thus, we were motivated to focus on the development of tools serving for epistatic studies and to seek for analysis strategy jointly analyzed multiple phenotypes. By exploiting the technical and methodological progress, we developed three R packages. SimPhe was built based on the Cockerham epistasis model to simulate (multiple correlated) phenotype(s) with epistatic effects. Another two packages, episcan and gpuEpiScan, simplified the calculation of EPIBALSTER and epiHSIC and were implemented with high performance, especially the package based on Graphics Processing Unit (GPU). The two packages can be employed by epistasis detection in both case-control studies and quantitative trait studies. Our packages might help drive down costs of computation and increase innovation in epistatic studies. Moreover, we explored the gene-gene interactions on developmental dyslexia, which is mainly characterized by reading problems in children. Multivariate meta-analysis was performed on genome-wide interaction study (GWIS) for reading-related phenotypes in the dyslexia dataset, which contains nine cohorts from different locations. We identified one genome-wide significant epistasis, rs1442415 and rs8013684, associated with word reading, as well as suggestive genetic interactions which might affect reading abilities. Except for rs1442415, which has been reported to influence educational attainment, the genetic variants involved in the suggestive interactions have shown associations with psychiatric disorders in previous GWASs, particularly with bipolar disorder. Our findings suggest making efforts to investigate not just the genetic interactions but also multiple correlated psychiatric disorders

    Dissecting genetic interactions in complex traits

    Get PDF
    Of central importance in the dissection of the components that govern complex traits is understanding the architecture of natural genetic variation. Genetic interaction, or epistasis, constitutes one aspect of this, but epistatic analysis has been largely avoided in genome wide association studies because of statistical and computational difficulties. This thesis explores both issues in the context of two-locus interactions. Initially, through simulation and deterministic calculations it was demonstrated that not only can epistasis maintain deleterious mutations at intermediate frequencies when under selection, but that it may also have a role in the maintenance of additive variance. Based on the epistatic patterns that are evolutionarily persistent, and the frequencies at which they are maintained, it was shown that exhaustive two dimensional search strategies are the most powerful approaches for uncovering both additive variance and the other genetic variance components that are co-precipitated. However, while these simulations demonstrate encouraging statistical benefits, two dimensional searches are often computationally prohibitive, particularly with the marker densities and sample sizes that are typical of genome wide association studies. To address this issue different software implementations were developed to parallelise the two dimensional triangular search grid across various types of high performance computing hardware. Of these, particularly effective was using the massively-multi-core architecture of consumer level graphics cards. While the performance will continue to improve as hardware improves, at the time of testing the speed was 2-3 orders of magnitude faster than CPU based software solutions that are in current use. Not only does this software enable epistatic scans to be performed routinely at minimal cost, but it is now feasible to empirically explore the false discovery rates introduced by the high dimensionality of multiple testing. Through permutation analysis it was shown that the significance threshold for epistatic searches is a function of both marker density and population sample size, and that because of the correlation structure that exists between tests the threshold estimates currently used are overly stringent. Although the relaxed threshold estimates constitute an improvement in the power of two dimensional searches, detection is still most likely limited to relatively large genetic effects. Through direct calculation it was shown that, in contrast to the additive case where the decay of estimated genetic variance was proportional to falling linkage disequilibrium between causal variants and observed markers, for epistasis this decay was exponential. One way to rescue poorly captured causal variants is to parameterise association tests using haplotypes rather than single markers. A novel statistical method that uses a regularised parameter selection procedure on two locus haplotypes was developed, and through extensive simulations it can be shown that it delivers a substantial gain in power over single marker based tests. Ultimately, this thesis seeks to demonstrate that many of the obstacles in epistatic analysis can be ameliorated, and with the current abundance of genomic data gathered by the scientific community direct search may be a viable method to qualify the importance of epistasis

    A Field Guide to Genetic Programming

    Get PDF
    xiv, 233 p. : il. ; 23 cm.Libro ElectrónicoA Field Guide to Genetic Programming (ISBN 978-1-4092-0073-4) is an introduction to genetic programming (GP). GP is a systematic, domain-independent method for getting computers to solve problems automatically starting from a high-level statement of what needs to be done. Using ideas from natural evolution, GP starts from an ooze of random computer programs, and progressively refines them through processes of mutation and sexual recombination, until solutions emerge. All this without the user having to know or specify the form or structure of solutions in advance. GP has generated a plethora of human-competitive results and applications, including novel scientific discoveries and patentable inventions. The authorsIntroduction -- Representation, initialisation and operators in Tree-based GP -- Getting ready to run genetic programming -- Example genetic programming run -- Alternative initialisations and operators in Tree-based GP -- Modular, grammatical and developmental Tree-based GP -- Linear and graph genetic programming -- Probalistic genetic programming -- Multi-objective genetic programming -- Fast and distributed genetic programming -- GP theory and its applications -- Applications -- Troubleshooting GP -- Conclusions.Contents xi 1 Introduction 1.1 Genetic Programming in a Nutshell 1.2 Getting Started 1.3 Prerequisites 1.4 Overview of this Field Guide I Basics 2 Representation, Initialisation and GP 2.1 Representation 2.2 Initialising the Population 2.3 Selection 2.4 Recombination and Mutation Operators in Tree-based 3 Getting Ready to Run Genetic Programming 19 3.1 Step 1: Terminal Set 19 3.2 Step 2: Function Set 20 3.2.1 Closure 21 3.2.2 Sufficiency 23 3.2.3 Evolving Structures other than Programs 23 3.3 Step 3: Fitness Function 24 3.4 Step 4: GP Parameters 26 3.5 Step 5: Termination and solution designation 27 4 Example Genetic Programming Run 4.1 Preparatory Steps 29 4.2 Step-by-Step Sample Run 31 4.2.1 Initialisation 31 4.2.2 Fitness Evaluation Selection, Crossover and Mutation Termination and Solution Designation Advanced Genetic Programming 5 Alternative Initialisations and Operators in 5.1 Constructing the Initial Population 5.1.1 Uniform Initialisation 5.1.2 Initialisation may Affect Bloat 5.1.3 Seeding 5.2 GP Mutation 5.2.1 Is Mutation Necessary? 5.2.2 Mutation Cookbook 5.3 GP Crossover 5.4 Other Techniques 32 5.5 Tree-based GP 39 6 Modular, Grammatical and Developmental Tree-based GP 47 6.1 Evolving Modular and Hierarchical Structures 47 6.1.1 Automatically Defined Functions 48 6.1.2 Program Architecture and Architecture-Altering 50 6.2 Constraining Structures 51 6.2.1 Enforcing Particular Structures 52 6.2.2 Strongly Typed GP 52 6.2.3 Grammar-based Constraints 53 6.2.4 Constraints and Bias 55 6.3 Developmental Genetic Programming 57 6.4 Strongly Typed Autoconstructive GP with PushGP 59 7 Linear and Graph Genetic Programming 61 7.1 Linear Genetic Programming 61 7.1.1 Motivations 61 7.1.2 Linear GP Representations 62 7.1.3 Linear GP Operators 64 7.2 Graph-Based Genetic Programming 65 7.2.1 Parallel Distributed GP (PDGP) 65 7.2.2 PADO 67 7.2.3 Cartesian GP 67 7.2.4 Evolving Parallel Programs using Indirect Encodings 68 8 Probabilistic Genetic Programming 8.1 Estimation of Distribution Algorithms 69 8.2 Pure EDA GP 71 8.3 Mixing Grammars and Probabilities 74 9 Multi-objective Genetic Programming 75 9.1 Combining Multiple Objectives into a Scalar Fitness Function 75 9.2 Keeping the Objectives Separate 76 9.2.1 Multi-objective Bloat and Complexity Control 77 9.2.2 Other Objectives 78 9.2.3 Non-Pareto Criteria 80 9.3 Multiple Objectives via Dynamic and Staged Fitness Functions 80 9.4 Multi-objective Optimisation via Operator Bias 81 10 Fast and Distributed Genetic Programming 83 10.1 Reducing Fitness Evaluations/Increasing their Effectiveness 83 10.2 Reducing Cost of Fitness with Caches 86 10.3 Parallel and Distributed GP are Not Equivalent 88 10.4 Running GP on Parallel Hardware 89 10.4.1 Master–slave GP 89 10.4.2 GP Running on GPUs 90 10.4.3 GP on FPGAs 92 10.4.4 Sub-machine-code GP 93 10.5 Geographically Distributed GP 93 11 GP Theory and its Applications 97 11.1 Mathematical Models 98 11.2 Search Spaces 99 11.3 Bloat 101 11.3.1 Bloat in Theory 101 11.3.2 Bloat Control in Practice 104 III Practical Genetic Programming 12 Applications 12.1 Where GP has Done Well 12.2 Curve Fitting, Data Modelling and Symbolic Regression 12.3 Human Competitive Results – the Humies 12.4 Image and Signal Processing 12.5 Financial Trading, Time Series, and Economic Modelling 12.6 Industrial Process Control 12.7 Medicine, Biology and Bioinformatics 12.8 GP to Create Searchers and Solvers – Hyper-heuristics xiii 12.9 Entertainment and Computer Games 127 12.10The Arts 127 12.11Compression 128 13 Troubleshooting GP 13.1 Is there a Bug in the Code? 13.2 Can you Trust your Results? 13.3 There are No Silver Bullets 13.4 Small Changes can have Big Effects 13.5 Big Changes can have No Effect 13.6 Study your Populations 13.7 Encourage Diversity 13.8 Embrace Approximation 13.9 Control Bloat 13.10 Checkpoint Results 13.11 Report Well 13.12 Convince your Customers 14 Conclusions Tricks of the Trade A Resources A.1 Key Books A.2 Key Journals A.3 Key International Meetings A.4 GP Implementations A.5 On-Line Resources 145 B TinyGP 151 B.1 Overview of TinyGP 151 B.2 Input Data Files for TinyGP 153 B.3 Source Code 154 B.4 Compiling and Running TinyGP 162 Bibliography 167 Inde

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Energy-Performance Optimization for the Cloud

    Get PDF

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Benchtop sequencing on benchtop computers

    Get PDF
    Next Generation Sequencing (NGS) is a powerful tool to gain new insights in molecular biology. With the introduction of the first bench top NGS sequencing machines (e.g. Ion Torrent, MiSeq), this technology became even more versatile in its applications and the amount of data that are produced in a short time is ever increasing. The demand for new and more efficient sequence analysis tools increases at the same rate as the throughput of sequencing technologies. New methods and algorithms not only need to be more efficient but also need to account for a higher genetic variability between the sequenced and annotated data. To obtain reliable results, information about errors and limitations of NGS technologies should also be investigated. Furthermore, methods need to be able to cope with contamination in the data. In this thesis we present methods and algorithms for NGS analysis. Firstly, we present a fast and precise method to align NGS reads to a reference genome. This method, called NextGenMap, was designed to work with data from Illumina, 454 and Ion Torrent technologies, and is easily extendable to new upcoming technologies. We use a pairwise sequence alignment in combination with an exact match filter approach to maximize the number of correctly mapped reads. To reduce runtime (mapping a 16x coverage human genome data set within hours) we developed an optimized banded pairwise alignment algorithm for NGS data. We implemented this algorithm using high performance programing interfaces for central processing units using SSE (Streaming SIMD Extensions) and OpenCL as well as for graphic processing units using OpenCL and CUDA. Thus, NextGenMap can make maximal use of all existing hardware no matter whether it is a high end compute cluster or a standard desktop computer or even a laptop. We demonstrated the advantages of NextGenMap based on real and simulated data over other mapping methods and showed that NextGenMap outperforms current methods with respect to the number of correctly mapped reads. The second part of the thesis is an analysis of limitations and errors of Ion Torrent and MiSeq. Sequencing errors were defined as the percentage of mismatches, insertion and deletions per position given a semi-global alignment mapping between read and reference sequence. We measured a mean error rate for MiSeq of 0.8\% and for Ion Torrent of 1.5\%. Moreover we identified for both technologies a non-uniform distribution of errors and even more severe of the corresponding nucleotide frequencies given a difference in the alignment. This is an important result since it reveals that some differences (e.g. mismatches) are more likely to occur than others and thus lead to a biased analysis. When looking at the distribution of the reads accross the sample carrier of the sequencing machine we discovered a clustering of reads that have a high difference (>30%> 30\%) compared to the reference sequence. This is unexpected since reads with a high difference are believed to origin either from contamination or errors in the library preparation, and should therefore be uniformly distributed on the sample carrier of the sequencing machine. Finally, we present a method called DeFenSe (Detection of Falsely Aligned Sequences) to detect and reduce contamination in NGS data. DeFenSe computes a pairwise alignment score threshold based on the alignment of randomly sampled reads to the reference genome. This threshold is then used to filter the mapped reads. It was applied in combination with two widely used mapping programs to real data resulting in a reduction of contamination of up to 99.8\%. In contrast to previous methods DeFenSe works independently of the number of differences between the reference and the targeted genome. Moreover, DeFenSe neither relies on ad hoc decisions like identity threshold or mapping quality thresholds nor does it require prior knowledge of the sequenced organism. The combination of these methods may lead to the possibility of transferring knowledge from model organisms to non model organisms by the usage of NGS. In addition, it enables to study biological mechanisms even in high polymorphic regions.Next Generation Sequencing (NGS) is a powerful tool to gain new insights in molecular biology. With the introduction of the first bench top NGS sequencing machines (e.g. Ion Torrent, MiSeq), this technology became even more versatile in its applications and the amount of data that are produced in a short time is ever increasing. The demand for new and more efficient sequence analysis tools increases at the same rate as the throughput of sequencing technologies. New methods and algorithms not only need to be more efficient but also need to account for a higher genetic variability between the sequenced and annotated data. To obtain reliable results, information about errors and limitations of NGS technologies should also be investigated. Furthermore, methods need to be able to cope with contamination in the data. In this thesis we present methods and algorithms for NGS analysis. Firstly, we present a fast and precise method to align NGS reads to a reference genome. This method, called NextGenMap, was designed to work with data from Illumina, 454 and Ion Torrent technologies, and is easily extendable to new upcoming technologies. We use a pairwise sequence alignment in combination with an exact match filter approach to maximize the number of correctly mapped reads. To reduce runtime (mapping a 16x coverage human genome data set within hours) we developed an optimized banded pairwise alignment algorithm for NGS data. We implemented this algorithm using high performance programing interfaces for central processing units using SSE (Streaming SIMD Extensions) and OpenCL as well as for graphic processing units using OpenCL and CUDA. Thus, NextGenMap can make maximal use of all existing hardware no matter whether it is a high end compute cluster or a standard desktop computer or even a laptop. We demonstrated the advantages of NextGenMap based on real and simulated data over other mapping methods and showed that NextGenMap outperforms current methods with respect to the number of correctly mapped reads. The second part of the thesis is an analysis of limitations and errors of Ion Torrent and MiSeq. Sequencing errors were defined as the percentage of mismatches, insertion and deletions per position given a semi-global alignment mapping between read and reference sequence. We measured a mean error rate for MiSeq of 0.8\% and for Ion Torrent of 1.5\%. Moreover we identified for both technologies a non-uniform distribution of errors and even more severe of the corresponding nucleotide frequencies given a difference in the alignment. This is an important result since it reveals that some differences (e.g. mismatches) are more likely to occur than others and thus lead to a biased analysis. When looking at the distribution of the reads accross the sample carrier of the sequencing machine we discovered a clustering of reads that have a high difference (>30%> 30\%) compared to the reference sequence. This is unexpected since reads with a high difference are believed to origin either from contamination or errors in the library preparation, and should therefore be uniformly distributed on the sample carrier of the sequencing machine. Finally, we present a method called DeFenSe (Detection of Falsely Aligned Sequences) to detect and reduce contamination in NGS data. DeFenSe computes a pairwise alignment score threshold based on the alignment of randomly sampled reads to the reference genome. This threshold is then used to filter the mapped reads. It was applied in combination with two widely used mapping programs to real data resulting in a reduction of contamination of up to 99.8\%. In contrast to previous methods DeFenSe works independently of the number of differences between the reference and the targeted genome. Moreover, DeFenSe neither relies on ad hoc decisions like identity threshold or mapping quality thresholds nor does it require prior knowledge of the sequenced organism. The combination of these methods may lead to the possibility of transferring knowledge from model organisms to non model organisms by the usage of NGS. In addition, it enables to study biological mechanisms even in high polymorphic regions
    corecore