169 research outputs found
Recommended from our members
Efficient analysis and storage of large-scale genomic data
The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed
algorithms, and heterogeneous computing.
First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available
instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus
Grid and high performance computing applied to bioinformatics
Recent advances in genome sequencing technologies and modern biological data
analysis technologies used in bioinformatics have led to a fast and continuous increase
in biological data. The difficulty of managing the huge amounts of data currently
available to researchers and the need to have results within a reasonable time have
led to the use of distributed and parallel computing infrastructures for their analysis.
In this context Grid computing has been successfully used. Grid computing is based
on a distributed system which interconnects several computers and/or clusters to
access global-scale resources. This infrastructure is
exible, highly scalable and can
achieve high performances with data-compute-intensive algorithms.
Recently, bioinformatics is exploring new approaches based on the use of hardware
accelerators, such as the Graphics Processing Units (GPUs). Initially developed as
graphics cards, GPUs have been recently introduced for scientific purposes by rea-
son of their performance per watt and the better cost/performance ratio achieved in
terms of throughput and response time compared to other high-performance com-
puting solutions.
Although developers must have an in-depth knowledge of GPU programming and
hardware to be effective, GPU accelerators have produced a lot of impressive results.
The use of high-performance computing infrastructures raises the question of finding
a way to parallelize the algorithms while limiting data dependency issues in order
to accelerate computations on a massively parallel hardware.
In this context, the research activity in this dissertation focused on the assessment
and testing of the impact of these innovative high-performance computing technolo-
gies on computational biology. In order to achieve high levels of parallelism and, in
the final analysis, obtain high performances, some of the bioinformatic algorithms
applicable to genome data analysis were selected, analyzed and implemented. These
algorithms have been highly parallelized and optimized, thus maximizing the GPU
hardware resources. The overall results show that the proposed parallel algorithms
are highly performant, thus justifying the use of such technology.
However, a software infrastructure for work
ow management has been devised to
provide support in CPU and GPU computation on a distributed GPU-based in-
frastructure. Moreover, this software infrastructure allows a further coarse-grained
data-parallel parallelization on more GPUs. Results show that the proposed appli-
cation speed-up increases with the increase in the number of GPUs
Grid and high performance computing applied to bioinformatics
Recent advances in genome sequencing technologies and modern biological data
analysis technologies used in bioinformatics have led to a fast and continuous increase
in biological data. The difficulty of managing the huge amounts of data currently
available to researchers and the need to have results within a reasonable time have
led to the use of distributed and parallel computing infrastructures for their analysis.
In this context Grid computing has been successfully used. Grid computing is based
on a distributed system which interconnects several computers and/or clusters to
access global-scale resources. This infrastructure is
exible, highly scalable and can
achieve high performances with data-compute-intensive algorithms.
Recently, bioinformatics is exploring new approaches based on the use of hardware
accelerators, such as the Graphics Processing Units (GPUs). Initially developed as
graphics cards, GPUs have been recently introduced for scientific purposes by rea-
son of their performance per watt and the better cost/performance ratio achieved in
terms of throughput and response time compared to other high-performance com-
puting solutions.
Although developers must have an in-depth knowledge of GPU programming and
hardware to be effective, GPU accelerators have produced a lot of impressive results.
The use of high-performance computing infrastructures raises the question of finding
a way to parallelize the algorithms while limiting data dependency issues in order
to accelerate computations on a massively parallel hardware.
In this context, the research activity in this dissertation focused on the assessment
and testing of the impact of these innovative high-performance computing technolo-
gies on computational biology. In order to achieve high levels of parallelism and, in
the final analysis, obtain high performances, some of the bioinformatic algorithms
applicable to genome data analysis were selected, analyzed and implemented. These
algorithms have been highly parallelized and optimized, thus maximizing the GPU
hardware resources. The overall results show that the proposed parallel algorithms
are highly performant, thus justifying the use of such technology.
However, a software infrastructure for work
ow management has been devised to
provide support in CPU and GPU computation on a distributed GPU-based in-
frastructure. Moreover, this software infrastructure allows a further coarse-grained
data-parallel parallelization on more GPUs. Results show that the proposed appli-
cation speed-up increases with the increase in the number of GPUs
Structural variation and the evolution of the mouse genome
Genetic variation in populations is governed by four basic forces: mutation, recombination, natural selection and genetic drift. Mutation is the source of new alleles, which are assorted among chromosomes by recombination. Selection and drift dictate the magnitude and direction of changes in allele frequency over time. These forces are intimately linked to meiosis, and asymmetries in meiosis create the opportunity for intragenomic conflict: competition between selfish alleles at the same locus for transmission to progeny. Such conflicts manifest as selection at the population level but subvert the Darwinian concept of fitness. The aim of this thesis is to characterize three of the four basic forces — recombination, mutation and intragenomic conflict — using the house mouse as a model system. I focus on the role of large segmental duplications, long tracts of repeated sequence that make up approximately 10% of mammalian genomes and are the site of the preponderance of structural variation between individuals. First I use two laboratory populations, the Collaborative Cross and the Diversity Outbred, to analyze the effects of sex and genetic background on the rate of recombination. I discover that (crossover) recombination is strongly suppressed in both sexes near large multiallelic copy-number variants. Second I reconstruct in detail the evolution of one such variant, R2d2. I show that R2d2 represents an ancient duplication that has been amplified to more than 100 copies in some lineages of European mice. Alleles with high copy number (R2d2HC) are associated with suppressed recombination but have an extremely high mutation rate. They are also selfish, having risen to high frequency in wild and laboratory populations by meiotic drive, in spite of their deleterious effect on reproductive fitness. Finally I perform the first comprehensive survey of sequence and structural variation on the mouse Y chromosome. I show that Y chromosomes have reduced sequence diversity relative to neutral expectations due to a strong population bottleneck in the recent past. I find that Y chromsomes vary dramatically in copy number of testis-specific genes, possibly as a side effect of past intragenomic conflict with the X chromosome.Doctor of Philosoph
Differential evolution of non-coding DNA across eukaryotes and its close relationship with complex multicellularity on Earth
Here, I elaborate on the hypothesis that complex multicellularity (CM, sensu Knoll) is a major evolutionary transition (sensu Szathmary), which has convergently evolved a few times in Eukarya only: within red and brown algae, plants, animals, and fungi. Paradoxically, CM seems to correlate with the expansion of non-coding DNA (ncDNA) in the genome rather than with genome size or the total number of genes. Thus, I investigated the correlation between genome and organismal complexities across 461 eukaryotes under a phylogenetically controlled framework. To that end, I introduce the first formal definitions and criteria to distinguish ‘unicellularity’, ‘simple’ (SM) and ‘complex’ multicellularity. Rather than using the limited available estimations of unique cell types, the 461 species were classified according to our criteria by reviewing their life cycle and body plan development from literature. Then, I investigated the evolutionary association between genome size and 35 genome-wide features (introns and exons from protein-coding genes, repeats and intergenic regions) describing the coding and ncDNA complexities of the 461 genomes. To that end, I developed ‘GenomeContent’, a program that systematically retrieves massive multidimensional datasets from gene annotations and calculates over 100 genome-wide statistics. R-scripts coupled to parallel computing were created to calculate >260,000 phylogenetic controlled pairwise correlations. As previously reported, both repetitive and non-repetitive DNA are found to be scaling strongly and positively with genome size across most eukaryotic lineages. Contrasting previous studies, I demonstrate that changes in the length and repeat composition of introns are only weakly or moderately associated with changes in genome size at the global phylogenetic scale, while changes in intron abundance (within and across genes) are either not or only very weakly associated with changes in genome size. Our evolutionary correlations are robust to: different phylogenetic regression methods, uncertainties in the tree of eukaryotes, variations in genome size estimates, and randomly reduced datasets. Then, I investigated the correlation between the 35 genome-wide features and the cellular complexity of the 461 eukaryotes with phylogenetic Principal Component Analyses. Our results endorse a genetic distinction between SM and CM in Archaeplastida and Metazoa, but not so clearly in Fungi. Remarkably, complex multicellular organisms and their closest ancestral relatives are characterized by high intron-richness, regardless of genome size. Finally, I argue why and how a vast expansion of non-coding RNA (ncRNA) regulators rather than of novel protein regulators can promote the emergence of CM in Eukarya. As a proof of concept, I co-developed a novel ‘ceRNA-motif pipeline’ for the prediction of “competing endogenous” ncRNAs (ceRNAs) that regulate microRNAs in plants. We identified three candidate ceRNAs motifs: MIM166, MIM171 and MIM159/319, which were found to be conserved across land plants and be potentially involved in diverse developmental processes and stress responses. Collectively, the findings of this dissertation support our hypothesis that CM on Earth is a major evolutionary transition promoted by the expansion of two major ncDNA classes, introns and regulatory ncRNAs, which might have boosted the irreversible commitment of cell types in certain lineages by canalizing the timing and kinetics of the eukaryotic transcriptome.:Cover page
Abstract
Acknowledgements
Index
1. The structure of this thesis
1.1. Structure of this PhD dissertation
1.2. Publications of this PhD dissertation
1.3. Computational infrastructure and resources
1.4. Disclosure of financial support and information use
1.5. Acknowledgements
1.6. Author contributions and use of impersonal and personal pronouns
2. Biological background
2.1. The complexity of the eukaryotic genome
2.2. The problem of counting and defining “genes” in eukaryotes
2.3. The “function” concept for genes and “dark matter”
2.4. Increases of organismal complexity on Earth through multicellularity
2.5. Multicellularity is a “fitness transition” in individuality
2.6. The complexity of cell differentiation in multicellularity
3. Technical background
3.1. The Phylogenetic Comparative Method (PCM)
3.2. RNA secondary structure prediction
3.3. Some standards for genome and gene annotation
4. What is in a eukaryotic genome? GenomeContent provides a good answer
4.1. Background
4.2. Motivation: an interoperable tool for data retrieval of gene annotations
4.3. Methods
4.4. Results
4.5. Discussion
5. The evolutionary correlation between genome size and ncDNA
5.1. Background
5.2. Motivation: estimating the relationship between genome size and ncDNA
5.3. Methods
5.4. Results
5.5. Discussion
6. The relationship between non-coding DNA and Complex Multicellularity
6.1. Background
6.2. Motivation: How to define and measure complex multicellularity across eukaryotes?
6.3. Methods
6.4. Results
6.5. Discussion
7. The ceRNA motif pipeline: regulation of microRNAs by target mimics
7.1. Background
7.2. A revisited protocol for the computational analysis of Target Mimics
7.3. Motivation: a novel pipeline for ceRNA motif discovery
7.4. Methods
7.5. Results
7.6. Discussion
8. Conclusions and outlook
8.1. Contributions and lessons for the bioinformatics of large-scale comparative analyses
8.2. Intron features are evolutionarily decoupled among themselves and from genome size throughout Eukarya
8.3. “Complex multicellularity” is a major evolutionary transition
8.4. Role of RNA throughout the evolution of life and complex multicellularity on Earth
9. Supplementary Data
Bibliography
Curriculum Scientiae
Selbständigkeitserklärung (declaration of authorship
Proceedings of the 7th Sound and Music Computing Conference
Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010
Tracking Biped Motion in Pervasive Environment
Textiles are ubiquitous to humans since ages. Transistors made of silicon have made a deep impact in modern industry. A new field of research called wearable electronics integrates both these worlds to provide intelligent new services. Based on modern technologies of textile manufacturing, a carpet is embedded with a network of computing devices. One of their applications is to sense, when someone walks over them. This carpet was used to track the path a person took on his walk. When a person steps on the carpet, embedded sensors in the carpet get activated. These activations are stored at a monitoring PC as a log file. This data is processed and carefully viewed by data mining algorithms to identify hidden patterns that reveal the trails of the subject on his motion over the carpet. Different methods for validating the data mining algorithms are presented. These methods are perfected to produce an ideal reference in a format that can be directly compared with the estimated results of the algorithms. The evaluation results show a better performance for the new approach compared to the state-of-the-art technologies. Veracious testing, discussions, suggestions and their impact after implementation, are discussed in detail. The concepts used in the data mining algorithms are structurally sound and maintainable. Suggestions are given for further work on this system as whole. The footsteps of the person walking on the carpet are identified. The trajectory of walk is traced. The carpet can be used in a variety of domains. Rich examples on usage, assisted with augmented literature conclude this work
Recommended from our members
Interferometric Synthetic Aperture Radar for remote satellite monitoring of bridges
The structural health of critical infrastructure is difficult to assess and monitor with existing methods of evaluation which rely predominantly on visual inspection and/or the installation of sensors to measure the in-situ performance of structures. There are vast numbers of critical structures that need to be monitored and these are often located in diverse geographical locations which are difficult and costly to access. Recent advances in satellite technologies provide the opportunity for global coverage of assets and the measurement of displacement to sub-centimetre accuracy. Such measurements could supplement existing monitoring techniques and provide asset owners with additional insights which could inform operational and maintenance decisions.
Most past research within the field of Interferometric Synthetic Aperture Radar (InSAR) monitoring using satellite radar imagery focusses on widespread measurement of land areas, although there have been some case studies using InSAR to assess movements of individual structures such as dams. However, there is limited published research into the use of these techniques for accurately monitoring the displacements of individual civil engineering structures over time and relating these measurements to structural performance. This research focusses on bridges as a specific example of critical infrastructure to establish whether remote satellite monitoring can be used to measure displacements at a resolution that is sufficiently accurate for use in monitoring of performance, and examines the relevance and limitations of satellite monitoring to civil engineering applications in general.
In order to assess the millimetre-scale performance of InSAR, an initial evaluation was undertaken in controlled conditions on a purpose-built test bed fitted with satellite reflectors at the National Physical Laboratory in Teddington to validate InSAR displacement measurements against traditional terrestrial in-situ displacement measurements. Subsequently, traditional sensor and surveying measurements of displacements were compared with InSAR displacement measurements at key points of interest on Waterloo Bridge and the Hammersmith Flyover. A further case study on Tadcaster Bridge was undertaken to demonstrate the potential applicability of InSAR displacement measuring techniques for monitoring bridges at risk of scour failure. Scour is the most common form of bridge collapse around the world and to date no cost-effective and widely applicable method for providing advanced warning of impending failure due to scour has been developed. Methodologies for integrating digital, structural and signal processing models for the identification and mapping of InSAR measurement points on bridge structures from SAR imagery were developed, as well as methodologies for combining satellite data with traditional surveying methods.
An important outcome of this research was that through comparison of independent measurements, InSAR measurements are of a scale that is applicable to bridge monitoring. Remote sensing can therefore reach global coverage, with unsupervised readings over an interval of days, and as such supplement traditional inspection regimes. However, this outcome must be presented with several limitations. Practical implications of applying InSAR to real bridges are discussed, including imaging effects and the suitability of monitoring different forms of bridge deformation.
The key to successful implementation of InSAR monitoring of bridges lies in understanding the limitations and opportunities of InSAR, and making a clear case to satellite data providers on what specifications (resolution, frequency, processing assumptions) would unlock using such datasets for wider use in monitoring of infrastructure. InSAR can provide measurements and useful insights for bridge monitoring but it is limited to specific cases and, at this stage of technological development, it should be considered as a tool for specific bridges and failure mechanisms rather than a full bridge monitoring solution.This PhD was funded by the Engineering and Physical Sciences Research Council (EPSRC), U.K., under Award 1636878 with iCASE sponsorship by the National Physical Laboratory. Further funding contributions were provided by Laing O’Rourke.
Projects within the PhD received funding from Innovate UK and some of the data was provided by the German Aerospace Centre (DLR) under proposal MTH3513
Machine Learning Approaches for Natural Resource Data
Abstract
Real life applications involving efficient management of natural resources are dependent on accurate geographical information. This information is usually obtained by manual on-site data collection, via automatic remote sensing methods, or by the mixture of the two. Natural resource management, besides accurate data collection, also requires detailed analysis of this data, which in the era of data flood can be a cumbersome process. With the rising trend in both computational power and storage capacity, together with lowering hardware prices, data-driven decision analysis has an ever greater role.
In this thesis, we examine the predictability of terrain trafficability conditions and forest attributes by using a machine learning approach with geographic information system data. Quantitative measures on the prediction performance of terrain conditions using natural resource data sets are given through five distinct research areas located around Finland. Furthermore, the estimation capability of key forest attributes is inspected with a multitude of modeling and feature selection techniques. The research results provide empirical evidence on whether the used natural resource data is sufficiently accurate enough for practical applications, or if further refinement on the data is needed. The results are important especially to forest industry since even slight improvements to the natural resource data sets utilized in practice can result in high saves in terms of operation time and costs.
Model evaluation is also addressed in this thesis by proposing a novel method for estimating the prediction performance of spatial models. Classical model goodness of fit measures usually rely on the assumption of independently and identically distributed data samples, a characteristic which normally is not true in the case of spatial data sets. Spatio-temporal data sets contain an intrinsic property called spatial autocorrelation, which is partly responsible for breaking these assumptions. The proposed cross validation based evaluation method provides model performance estimation where optimistic bias due to spatial autocorrelation is decreased by partitioning the data sets in a suitable way.
Keywords: Open natural resource data, machine learning, model evaluationTiivistelmä
Käytännön sovellukset, joihin sisältyy luonnonvarojen hallintaa ovat riippuvaisia tarkasta paikkatietoaineistosta. Tämä paikkatietoaineisto kerätään usein manuaalisesti paikan päällä, automaattisilla kaukokartoitusmenetelmillä tai kahden edellisen yhdistelmällä. Luonnonvarojen hallinta vaatii tarkan aineiston keräämisen lisäksi myös sen yksityiskohtaisen analysoinnin, joka tietotulvan aikakautena voi olla vaativa prosessi. Nousevan laskentatehon, tallennustilan sekä alenevien laitteistohintojen myötä datapohjainen päätöksenteko on yhä suuremmassa roolissa.
Tämä väitöskirja tutkii maaston kuljettavuuden ja metsäpiirteiden ennustettavuutta käyttäen koneoppimismenetelmiä paikkatietoaineistojen kanssa. Maaston kuljettavuuden ennustamista mitataan kvantitatiivisesti käyttäen kaukokartoitusaineistoa viideltä eri tutkimusalueelta ympäri Suomea. Tarkastelemme lisäksi tärkeimpien metsäpiirteiden ennustettavuutta monilla eri mallintamistekniikoilla ja piirteiden valinnalla. Väitöstyön tulokset tarjoavat empiiristä todistusaineistoa siitä, onko käytetty luonnonvaraaineisto riittävän laadukas käytettäväksi käytännön sovelluksissa vai ei. Tutkimustulokset ovat tärkeitä erityisesti metsäteollisuudelle, koska pienetkin parannukset luonnonvara-aineistoihin käytännön sovelluksissa voivat johtaa suuriin säästöihin niin operaatioiden ajankäyttöön kuin kuluihin.
Tässä työssä otetaan kantaa myös mallin evaluointiin esittämällä uuden menetelmän spatiaalisten mallien ennustuskyvyn estimointiin. Klassiset mallinvalintakriteerit nojaavat yleensä riippumattomien ja identtisesti jakautuneiden datanäytteiden oletukseen, joka ei useimmiten pidä paikkaansa spatiaalisilla datajoukoilla. Spatio-temporaaliset datajoukot sisältävät luontaisen ominaisuuden, jota kutsutaan spatiaaliseksi autokorrelaatioksi. Tämä ominaisuus on osittain vastuussa näiden oletusten rikkomisesta. Esitetty ristiinvalidointiin perustuva evaluointimenetelmä tarjoaa mallin ennustuskyvyn mitan, missä spatiaalisen autokorrelaation vaikutusta vähennetään jakamalla datajoukot sopivalla tavalla.
Avainsanat: Avoin luonnonvara-aineisto, koneoppiminen, mallin evaluoint
- …