Search CORE

78 research outputs found

Profile-driven parallelisation of sequential programs

Author: Tournavitis Georgios
Publication venue: The University of Edinburgh
Publication date: 30/06/2011
Field of study

Traditional parallelism detection in compilers is performed by means of static analysis and more specifically data and control dependence analysis. The information that is available at compile time, however, is inherently limited and therefore restricts the parallelisation opportunities. Furthermore, applications written in C – which represent the majority of today’s scientific, embedded and system software – utilise many lowlevel features and an intricate programming style that forces the compiler to even more conservative assumptions. Despite the numerous proposals to handle this uncertainty at compile time using speculative optimisation and parallelisation, the software industry still lacks any pragmatic approaches that extracts coarse-grain parallelism to exploit the multiple processing units of modern commodity hardware. This thesis introduces a novel approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C. We utilise profiling information to overcome the limitations of static data and control-flow analysis enabling more aggressive parallelisation. Profiling is performed using an instrumentation scheme operating at the Intermediate Representation (Ir) level of the compiler. In contrast to existing approaches that depend on low-level binary tools and debugging information, Ir-profiling provides precise and direct correlation of profiling information back to the Ir structures of the compiler. Additionally, our approach is orthogonal to existing automatic parallelisation approaches and additional fine-grain parallelism may be exploited. We demonstrate the applicability and versatility of the proposed methodology using two studies that target different forms of parallelism. First, we focus on the exploitation of loop-level parallelism that is abundant in many scientific and embedded applications. We evaluate our parallelisation strategy against the Nas and Spec Fp benchmarks and two different multi-core platforms (a shared-memory Intel Xeon Smp and a heterogeneous distributed-memory Ibm Cell blade). Empirical evaluation shows that our approach not only yields significant improvements when compared with state-of- the-art parallelising compilers, but comes close to and sometimes exceeds the performance of manually parallelised codes. On average, our methodology achieves 96% of the performance of the hand-tuned parallel benchmarks on the Intel Xeon platform, and a significant speedup for the Cell platform. The second study, addresses the problem of partially sequential loops, typically found in implementations of multimedia codecs. We develop a more powerful whole-program representation based on the Program Dependence Graph (Pdg) that supports profiling, partitioning and codegeneration for pipeline parallelism. In addition we demonstrate how this enhances conventional pipeline parallelisation by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. Experimental results using a set of complex multimedia and stream processing benchmarks confirm the effectiveness of the proposed methodology that yields speedups up to 4.7 on a eight-core Intel Xeon machine

Edinburgh Research Archive

Computational statistics using the Bayesian Inference Engine

Author: Babu
Berntsen
Feroz
Gelman
Gelman
Gelman
Geyer
Giakoumatos
Green
Gregory
Grubbs
Hastings
Hobson
Jeffreys
Kass
Kirkpatrick
Lewis
Lindley
Liu
Lu
Lu
Martin D. Weinberg
Metropolis
Neal
Neal
Newton
Pearson
Press
Price
Raftery
Robert
Skilling
Storn
Storn
Sérsic
Ter Braak
Verdinelli
Wall
Weinberg
Weinberg
Yoon
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

This paper introduces the Bayesian Inference Engine (BIE), a general parallel, optimised software package for parameter inference and model selection. This package is motivated by the analysis needs of modern astronomical surveys and the need to organise and reuse expensive derived data. The BIE is the first platform for computational statistics designed explicitly to enable Bayesian update and model comparison for astronomical problems. Bayesian update is based on the representation of high-dimensional posterior distributions using metric-ball-tree based kernel density estimation. Among its algorithmic offerings, the BIE emphasises hybrid tempered MCMC schemes that robustly sample multimodal posterior distributions in high-dimensional parameter spaces. Moreover, the BIE is implements a full persistence or serialisation system that stores the full byte-level image of the running inference and previously characterised posterior distributions for later use. Two new algorithms to compute the marginal likelihood from the posterior distribution, developed for and implemented in the BIE, enable model comparison for complex models and data sets. Finally, the BIE was designed to be a collaborative platform for applying Bayesian methodology to astronomy. It includes an extensible object-oriented and easily extended framework that implements every aspect of the Bayesian inference. By providing a variety of statistical algorithms for all phases of the inference problem, a scientist may explore a variety of approaches with a single model and data implementation. Additional technical details and download details are available from http://www.astro.umass.edu/bie. The BIE is distributed under the GNU GPL.Comment: Resubmitted version. Additional technical details and download details are available from http://www.astro.umass.edu/bie. The BIE is distributed under the GNU GP

arXiv.org e-Print Archive

CiteSeerX

Crossref

New Approaches to Long-Read Assembly under High Error Rates

Author: Bongartz Philipp
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2020
Field of study

Das Gebiet der Genomassemblierung beschäftigt sich mit der Entwicklung von Algorithmen, die Genome am Computer anhand von Sequenzierungsdaten rekonstruieren. Es geriet erstmals in den Neunzigern mit dem Human Genome Project in den Fokus der Öffentlichkeit. Da nur kurze Abschnitte des menschlichen Genoms ausgelesen werden konnten, musste die Rekonstruktion längerer Genomsequenzen aus den ausgelesenen Abschnitten im Nachhinein am Computer erfolgen. Auch fast 20 Jahre nach der Veröffentlichung der menschlichen Genomsequenzen stellt die Genomeassemblierung nach wie vor noch einen essentiellen Verarbeitungsschritt für Sequenzierungsdaten dar. Nur Datendurchsatz, Länge und Fehlerprofil der ausgelesenen Genomabschnitte haben sich verändert und damit einhergehend auch die algorithmischen Anforderungen. Damit komplementiert das Forschungsgebiet der Genomeassemblierung die Sequenzierungstechnologien, die sich mit enormer Geschwindigkeit weiter entwickelt haben. Zusammen erlauben sie die Entschlüsselung der Genome einer stark zunehmenden Anzahl von Lebewesen und bilden damit die Grundlage für einen Großteil der Forschung in verschiedensten Bereichen der Biologie und Medizin. Trotz der beeindruckenden technologischen und algorithmischen Entwicklungen der vergangenen Jahrzehnte ist es bisher nur für bakterielle Genome gelungen, die komplette Genomsequenz zu rekontruieren. Bei der Assemblierung der wesentlich größeren eukaryotischen Genome bestehen mehrere ungelöste algorithmische Probleme. Diese Probleme hängen mit verschiedenen repetitiven Strukturen zusammen, die in fast allen Genomen höherer Lebewesen vorkommen. Deshalb werden eukaryotische Genome immer in wesentlich mehr unzusammenhängenden Sequenzen veröffentlicht als die jeweiligen Lebewesen Chromosomen haben. Die repetitiven Strukturen, die für die Lücken in den Genomsequenzen verantwortlich sind, lassen sich grob in drei Klassen unterteilen. Mikrosatelliten und Minisatelliten sind sehr kurze Sequenzen, die sich tausende oder zehntausende Male direkt aufeinander folgend wiederholen können. Dieses Muster ist typisch für sogenannte Centromere und Telomere, die sich in der Mitte und an den Enden vieler Chromosome befinden. Sogenannte Interspersed Repeats, oft auch als Transposons bezeichnet, sind längere Sequenzen, die häufig in fast identischer Form an unterschiedlichen Stellen im Genome vorkommen. Sogenannte Tandem Repeats dagegen sind längere Sequenzen, die direkt aufeinanderfolgend mehrere Male in einem Genom auftreten können. Oft sind Tandem Repeats Genkomplexe, das heißt Ansammlungen fast identischer proteinkodierender Abschnitte, die es der Zelle erlauben, die kodierten Proteine besonders schnell zu produzieren. Jede dieser repetitive Strukturen stellt spezifische Anforderung an Assemblierungsalgorithmen. In dieser Doktorarbeit leisten wir mehrere Beiträge zur Lösung der letzteren zwei vorgestellten Probleme, der Assemblierung von Interspersed Repeats und Tandem Repeats. In Teil 1 der Arbeit stellen wir mehrere Datenverarbeitungsprozeduren vor, die Sequenzierungsdaten aufbereiten, um die seltenen Unterschiede zwischen mehrfach auftretenden Genomsequenzen zu identifizieren. Diese beinhalten Softwareprogramme zur Berechnung und Optimierung von Multiplen Sequenz Alignments (MSA) anhand dynamischer Programmierung und zur statistischen Modellierung und Analyse der Unterschiede, wie das MSA sie präsentiert. In Teil 2 bauen wir auf dieser Analyse auf und präsentieren ein Softwareprogramm zur Assemblierung von Interspersed Repeats. Dieses Programm baut auf mehreren algorithmischen Neuerungen auf und ist in der Lage, Transposonfamilien mit sehr langen Sequenzen und sehr vielen verschiedenen Kopien effektiv zu assemblieren. Es ist das erste Programm dieser Art, welches in der Lage ist, Transposonfamilien mit dutzenden von Kopien zu assemblieren. Es gelingt uns zu zeigen, dass es auch für kleinere Transposonfamilien akkurater und schneller ist als das bisher einzige Konkurrenzprogramm, welches auf dieses Assemblierungsproblem spezialisiert ist. In Teil 3 beschreiben wir eine Analysepipeline, die es uns ermöglicht, Genkomplexe aus dutzenden von Tandem Repeats zu assemblieren. Diese Pipeline enthält Clustering und Graph Drawing Algorithmen. Ihr Herzstück ist ein Fehlerkorrekturalgorithmus, der auf Neuronalen Netzwerken basiert. Wir demonstrieren den praktischen Nutzen dieser Pipeline durch die Assemblierung des Drosophila Histone Komplexes. Im Abschluss diskutieren wir die Möglichkeit, Mikro- und Minisatelliten zu assemblieren und schlagen Forschungsansätze für weitere Verbesserungen im Bereich der Interspersed Repeat- und Genkomplexassemblierung vor

KITopen

Fractal Dimensions in Classical and Quantum Mechanical Open Chaotic Systems

Author: Schönwetter Moritz
Publication venue
Publication date: 17/01/2017
Field of study

Fractals have long been recognized to be a characteristic feature arising from chaotic dynamics; be it in the form of strange attractors, of fractal boundaries around basins of attraction, or of fractal and multifractal distributions of asymptotic measures in open systems. In this thesis we study fractal and multifractal measure distributions in leaky Hamiltonian systems. Leaky systems are created by introducing a fully or partially transparent hole in an otherwise closed system, allowing trajectories to escape or lose some of their intensity. This dynamics results in intricate (multi)fractal distributions of the surviving trajectories. These systems are suitable models for experimental setups such as optical microcavities or microwave resonators. In this thesis we perform an improved investigation of the fractality in these systems using the concept of effective dimensions. They are defined as the dimensions far from the usually considered asymptotics of infinite evolution time

t

, infinite sample size

S

, and infinite resolution (infinitesimal box-size

varepsilon

). Yet, as we show, effective dimensions can be considered as intrinsic to the dynamics of the system. We present a detailed discussion of the behaviour of the numerically observed dimension

D_mathrm{obs}(S,t,varepsilon)

. We show that the three parameters can be expressed in terms of limiting length scales that define the parameter ranges in which

D_mathrm{obs}(S,t,varepsilon)

is an effective dimension of the system. We provide dynamical and statistical arguments for the dependence of these scales on

S

t

, and

varepsilon

in strongly chaotic systems and show that the knowledge of the scales allows us to define meaningful effective dimensions. We apply our results to three main fields. In the context of numerical algorithms to calculate dimensions, we show that our findings help to numerically find the range of box sizes leading to accurate results. We further show that they allow us to minimize the computational cost by providing estimates of the required sample-size and iteration time needed. A second application field of our results is systems exhibiting non-trivial dependencies of the effective dimension

D_mathrm{eff}

t

and

varepsilon

. We numerically explore this in weakly chaotic leaky systems. There, our findings provide insight into the dynamics of the systems, since deviations from our predictions based on strongly chaotic systems at a given parameter range are a sign that the stickiness inherent to such systems needs to be taken into account in that range. Lastly, we show that in quantum analogues of chaotic maps with a partial leak, a related effective dimension can be used to explain the numerically observed deviation from the predictions provided by the fractal Weyl law for systems with fully absorbing leaks. Here, we provide an analytical description of the expected scaling based on the classical dynamics of the system and compare it with numerical results obtained in the studied quantum maps.Es ist seit langem bekannt, dass Fraktale eine charakteristische Begleiterscheinung chaotischer Dynamik sind. Sie treten in Form von seltsamen Attraktoren, von fraktalen Begrenzungen der Einzugsbereiche von Attraktoren oder von fraktalen und multifraktalen Verteilungen asymptotischer Maße in offenen Systemen auf. In dieser Arbeit betrachten wir fraktal und multifraktal verteilte Maße in geöffneten hamiltonschen Systemen. Geöffnete Systeme werden dadurch erzeugt, dass man ein völlig oder teilweise transparentes Loch im Phasenraum definiert, durch das Trajektorien entkommen können oder in dem sie einen Teil ihrer Intensität verlieren. Die Dynamik in solchen Systemen erzeugt komplexe (multi)fraktale Verteilungen der verbleibenden Trajektorien, beziehungsweise ihrer Intensitäten. Diese Systeme sind zur Modellierung experimenteller Aufbauten, wie zum Beispiel optischer Mikrokavitäten oder Mikrowellenresonatoren, geeignet. In dieser Arbeit führen wir eine verbesserte Untersuchung der Fraktalität in derartigen Systemen durch, die auf dem Konzept der effektiven Dimensionen beruht. Diese sind als die Dimensionen definiert, die weit weg von den üblicherweise betrachteten Limites unendlicher Iterationszeit

t

, unendlicher Stichprobengröße

S

und unendlicher Auflösung, also infinitesimaler Boxgröße

varepsilon

auftreten. Dennoch können effektive Dimensionen, wie wir zeigen, als der Dynamik des Systems inhärent angesehen werden. Wir führen eine detaillierte Diskussion der numerisch beobachteten Dimension

D_mathrm{obs}(S,t,varepsilon)

durch und zeigen, dass die drei Parameter

S

t

und

varepsilon

in Form grenzwertiger Längenskalen ausgedrückt werden können, die die Parameterbereiche definieren, in denen

D_mathrm{obs}(S,t,varepsilon)

den Wert einer effektiven Dimension des Systems annimmt. Wir beschreiben das Verhalten dieser Längenskalen in stark chaotischen Systemen als Funktionen von

S

t

und

varepsilon

anhand statistischer Überlegungen und anhand von auf der Dynamik basierenden Aussagen. Weiterhin zeigen wir, dass das Wissen um diese Längenskalen die Definition aussagekräftiger effektiver Dimensionen ermöglicht. Wir wenden unsere Ergebnisse hauptsächlich in drei Bereichen an: Im Kontext numerischer Algorithmen zur Dimensionsberechnung zeigen wir, dass unsere Ergebnisse es erlauben, diejenigen

varepsilon

-Bereiche zu finden, die zu korrekten Ergebnissen führen. Weiterhin zeigen wir, dass sie es uns erlauben, den Rechenaufwand zu minimieren, indem sie uns eine Abschätzung der benötigten Stichprobengröße und Iterationszeit ermöglichen. Ein zweiter Anwendungsbereich sind Systeme, die sich durch eine nichttriviale Abhängigkeit von

D_mathrm{eff}

von

t

und

varepsilon

auszeichnen. Hier ermöglichen unsere Ergebnisse ein besseres Verständnis der Systeme, da Abweichungen von den Vorhersagen basierend auf der Annahme von starker Chaotizität ein Anzeichen dafür sind, dass im entsprechenden Parameterbereich die Eigenschaft dieser Systeme, dass Bereiche in ihrem Phasenraum Trajektorien für eine begrenzte Zeit einfangen können, relevant ist. Zuletzt zeigen wir, dass in quantenmechanischen Analoga chaotischer Abbildungen mit partiellen Öffnungen eine verwandte effektive Dimension genutzt werden kann, um die numerisch beobachteten Abweichungen vom fraktalen weyl'schen Gesetz für völlig transparente Öffnungen zu erklären. In diesem Zusammenhang zeigen wir eine analytische Beschreibung des erwarteten Skalierungsverhaltens auf, die auf der klassischen Dynamik des Systems basiert, und vergleichen sie mit numerischen Erkenntnissen, die wir über die Quantenabbildungen gewonnen haben

Technische Universität Dresden: Qucosa

Hypersweeps, Convective Clouds and Reeb Spaces

Author: Hristov Petar Georgiev
Publication venue
Publication date: 01/06/2022
Field of study

Isosurfaces are one of the most prominent tools in scientific data visualisation. An isosurface is a surface that defines the boundary of a feature of interest in space for a given threshold. This is integral in analysing data from the physical sciences which observe and simulate three or four dimensional phenomena. However it is time consuming and impractical to discover surfaces of interest by manually selecting different thresholds. The systematic way to discover significant isosurfaces in data is with a topological data structure called the contour tree. The contour tree encodes the connectivity and shape of each isosurface at all possible thresholds. The first part of this work has been devoted to developing algorithms that use the contour tree to discover significant features in data using high performance computing systems. Those algorithms provided a clear speedup over previous methods and were used to visualise physical plasma simulations. A major limitation of isosurfaces and contour trees is that they are only applicable when a single property is associated with data points. However scientific data sets often take multiple properties into account. A recent breakthrough generalised isosurfaces to fiber surfaces. Fiber surfaces define the boundary of a feature where the threshold is defined in terms of multiple parameters, instead of just one. In this work we used fiber surfaces together with isosurfaces and the contour tree to create a novel application that helps atmosphere scientists visualise convective cloud formation. Using this application, they were able to, for the first time, visualise the physical properties of certain structures that trigger cloud formation. Contour trees can also be generalised to handle multiple parameters. The natural extension of the contour tree is called the Reeb space and it comes from the pure mathematical field of fiber topology. The Reeb space is not yet fully understood mathematically and algorithms for computing it have significant practical limitations. A key difficulty is that while the contour tree is a traditional one dimensional data structure made up of points and lines between them, the Reeb space is far more complex. The Reeb space is made up of two dimensional sheets, attached to each other in intricate ways. The last part of this work focuses on understanding the structure of Reeb spaces and the rules that are followed when sheets are combined. This theory builds towards developing robust combinatorial algorithms to compute and use Reeb spaces for practical data analysis

White Rose E-theses Online

Automated Analysis of Abdominal Aortic Calcification in Vertebral Fracture Assessment Images

Author: Chaplin Luke
Publication venue
Publication date: 31/12/2020
Field of study

The University of Manchester - Institutional Repository

Discrete Wavelet Transforms

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The discrete wavelet transform (DWT) algorithms have a firm position in processing of signals in several areas of research and industry. As DWT provides both octave-scale frequency and spatial timing of the analyzed signal, it is constantly used to solve and treat more and more advanced problems. The present book: Discrete Wavelet Transforms: Algorithms and Applications reviews the recent progress in discrete wavelet transform algorithms and applications. The book covers a wide range of methods (e.g. lifting, shift invariance, multi-scale analysis) for constructing DWTs. The book chapters are organized into four major parts. Part I describes the progress in hardware implementations of the DWT algorithms. Applications include multitone modulation for ADSL and equalization techniques, a scalable architecture for FPGA-implementation, lifting based algorithm for VLSI implementation, comparison between DWT and FFT based OFDM and modified SPIHT codec. Part II addresses image processing algorithms such as multiresolution approach for edge detection, low bit rate image compression, low complexity implementation of CQF wavelets and compression of multi-component images. Part III focuses watermaking DWT algorithms. Finally, Part IV describes shift invariant DWTs, DC lossless property, DWT based analysis and estimation of colored noise and an application of the wavelet Galerkin method. The chapters of the present book consist of both tutorial and highly advanced material. Therefore, the book is intended to be a reference text for graduate students and researchers to obtain state-of-the-art knowledge on specific applications

Directory of Open Access Books (DOAB)

A Survey on Text Classification Algorithms: From Text to Predictions

Author: Albarelli A
Gasparetto A
Marcuzzo M
Zangari A
Publication venue: 'MDPI AG'
Publication date: 01/01/2022
Field of study

In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

3D exemplar-based image inpainting in electron microscopy

Author: Trampert Patrick
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2019
Field of study

In electron microscopy (EM) a common problem is the non-availability of data, which causes artefacts in reconstructions. In this thesis the goal is to generate artificial data where missing in EM by using exemplar-based inpainting (EBI). We implement an accelerated 3D version tailored to applications in EM, which reduces reconstruction times from days to minutes. We develop intelligent sampling strategies to find optimal data as input for reconstruction methods. Further, we investigate approaches to reduce electron dose and acquisition time. Sparse sampling followed by inpainting is the most promising approach. As common evaluation measures may lead to misinterpretation of results in EM and falsify a subsequent analysis, we propose to use application driven metrics and demonstrate this in a segmentation task. A further application of our technique is the artificial generation of projections in tiltbased EM. EBI is used to generate missing projections, such that the full angular range is covered. Subsequent reconstructions are significantly enhanced in terms of resolution, which facilitates further analysis of samples. In conclusion, EBI proves promising when used as an additional data generation step to tackle the non-availability of data in EM, which is evaluated in selected applications. Enhancing adaptive sampling methods and refining EBI, especially considering the mutual influence, promotes higher throughput in EM using less electron dose while not lessening quality.Ein häufig vorkommendes Problem in der Elektronenmikroskopie (EM) ist die Nichtverfügbarkeit von Daten, was zu Artefakten in Rekonstruktionen führt. In dieser Arbeit ist es das Ziel fehlende Daten in der EM künstlich zu erzeugen, was durch Exemplar-basiertes Inpainting (EBI) realisiert wird. Wir implementieren eine auf EM zugeschnittene beschleunigte 3D Version, welche es ermöglicht, Rekonstruktionszeiten von Tagen auf Minuten zu reduzieren. Wir entwickeln intelligente Abtaststrategien, um optimale Datenpunkte für die Rekonstruktion zu erhalten. Ansätze zur Reduzierung von Elektronendosis und Aufnahmezeit werden untersucht. Unterabtastung gefolgt von Inpainting führt zu den besten Resultaten. Evaluationsmaße zur Beurteilung der Rekonstruktionsqualität helfen in der EM oft nicht und können zu falschen Schlüssen führen, weswegen anwendungsbasierte Metriken die bessere Wahl darstellen. Dies demonstrieren wir anhand eines Beispiels. Die künstliche Erzeugung von Projektionen in der neigungsbasierten Elektronentomographie ist eine weitere Anwendung. EBI wird verwendet um fehlende Projektionen zu generieren. Daraus resultierende Rekonstruktionen weisen eine deutlich erhöhte Auflösung auf. EBI ist ein vielversprechender Ansatz, um nicht verfügbare Daten in der EM zu generieren. Dies wird auf Basis verschiedener Anwendungen gezeigt und evaluiert. Adaptive Aufnahmestrategien und EBI können also zu einem höheren Durchsatz in der EM führen, ohne die Bildqualität merklich zu verschlechtern

Universaar

Acronym

Computational approaches for metagenomic analysis of high-throughput sequencing data

Author: Ainsworth David
Publication venue: Life Sciences, Imperial College London
Publication date: 01/01/2017
Field of study

High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This “data deluge” has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.Open Acces

Spiral - Imperial College Digital Repository