116 research outputs found
Efficient, Dependable Storage of Human Genome Sequencing Data
A compreensĂŁo do genoma humano impacta vĂĄrias ĂĄreas da vida. Os dados oriundos do genoma humano sĂŁo enormes pois existem milhĂ”es de amostras a espera de serem sequenciadas e cada genoma humano sequenciado pode ocupar centenas de gigabytes de espaço de armazenamento. Os genomas humanos sĂŁo crĂticos porque sĂŁo extremamente valiosos para a investigação e porque podem fornecer informaçÔes delicadas sobre o estado de saĂșde dos indivĂduos, identificar os seus dadores ou atĂ© mesmo revelar informaçÔes sobre os parentes destes. O tamanho e a criticidade destes genomas, para alĂ©m da quantidade de dados produzidos por instituiçÔes mĂ©dicas e de ciĂȘncias da vida, exigem que os sistemas informĂĄticos sejam escalĂĄveis, ao mesmo tempo que sejam seguros, confiĂĄveis, auditĂĄveis e com custos acessĂveis. As infraestruturas de armazenamento existentes sĂŁo tĂŁo caras que nĂŁo nos permitem ignorar a eficiĂȘncia de custos no armazenamento de genomas humanos, assim como em geral estas nĂŁo possuem o conhecimento e os mecanismos adequados para proteger a privacidade dos dadores de amostras biolĂłgicas. Esta tese propĂ”e um sistema de armazenamento de genomas humanos eficiente, seguro e auditĂĄvel para instituiçÔes mĂ©dicas e de ciĂȘncias da vida. Ele aprimora os ecossistemas de armazenamento tradicionais com tĂ©cnicas de privacidade, redução do tamanho dos dados e auditabilidade a fim de permitir o uso eficiente e confiĂĄvel de infraestruturas pĂșblicas de computação em nuvem para armazenar genomas humanos. As contribuiçÔes desta tese incluem (1) um estudo sobre a sensibilidade Ă privacidade dos genomas humanos; (2) um mĂ©todo para detetar sistematicamente as porçÔes dos genomas que sĂŁo sensĂveis Ă privacidade; (3) algoritmos de redução do tamanho de dados, especializados para dados de genomas sequenciados; (4) um esquema de auditoria independente para armazenamento disperso e seguro de dados; e (5) um fluxo de armazenamento completo que obtĂ©m garantias razoĂĄveis de proteção, segurança e confiabilidade a custos modestos (por exemplo, menos de 1/Genome/Year) by integrating the proposed mechanisms with appropriate storage configurations
Modelos de compressĂŁo e ferramentas para dados Ăłmicos
The ever-increasing growth of the development of high-throughput sequencing
technologies and as a consequence, generation of a huge volume of data,
has revolutionized biological research and discovery. Motivated by that, we
investigate in this thesis the methods which are capable of providing an
efficient representation of omics data in compressed or encrypted manner,
and then, we employ them to analyze omics data.
First and foremost, we describe a number of measures for the purpose
of quantifying information in and between omics sequences. Then, we
present finite-context models (FCMs), substitution-tolerant Markov models
(STMMs) and a combination of the two, which are specialized in modeling
biological data, in order for data compression and analysis.
To ease the storage of the aforementioned data deluge, we design two lossless
data compressors for genomic and one for proteomic data. The methods
work on the basis of (a) a combination of FCMs and STMMs or (b) the mentioned
combination along with repeat models and a competitive prediction
model. Tested on various synthetic and real data showed their outperformance
over the previously proposed methods in terms of compression ratio.
Privacy of genomic data is a topic that has been recently focused by developments
in the field of personalized medicine. We propose a tool that is
able to represent genomic data in a securely encrypted fashion, and at the
same time, is able to compact FASTA and FASTQ sequences by a factor
of three. It employs AES encryption accompanied by a shuffling mechanism
for improving the data security. The results show it is faster than
general-purpose and special-purpose algorithms.
Compression techniques can be employed for analysis of omics data. Having
this in mind, we investigate the identification of unique regions in a species
with respect to close species, that can give us an insight into evolutionary
traits. For this purpose, we design two alignment-free tools that can accurately
find and visualize distinct regions among two collections of DNA or
protein sequences. Tested on modern humans with respect to Neanderthals,
we found a number of absent regions in Neanderthals that may express new
functionalities associated with evolution of modern humans.
Finally, we investigate the identification of genomic rearrangements, that
have important roles in genetic disorders and cancer, by employing a compression
technique. For this purpose, we design a tool that is able to accurately
localize and visualize small- and large-scale rearrangements between
two genomic sequences. The results of applying the proposed tool on several
synthetic and real data conformed to the results partially reported by
wet laboratory approaches, e.g., FISH analysis.O crescente crescimento do desenvolvimento de tecnologias de sequenciamento
de alto rendimento e, como consequĂȘncia, a geração de um enorme
volume de dados, revolucionou a pesquisa e descoberta biolĂłgica. Motivados
por isso, nesta tese investigamos os métodos que fornecem uma
representação eficiente de dados ómicros de maneira compactada ou criptografada
e, posteriormente, os usamos para anĂĄlise.
Em primeiro lugar, descrevemos uma série de medidas com o objetivo de
quantificar informação em e entre sequencias ómicas. Em seguida, apresentamos
modelos de contexto finito (FCMs), modelos de Markov tolerantes
a substituição (STMMs) e uma combinação dos dois, especializados na
modelagem de dados biológicos, para compactação e anålise de dados.
Para facilitar o armazenamento do dilĂșvio de dados acima mencionado, desenvolvemos
dois compressores de dados sem perda para dados genĂłmicos e
um para dados proteómicos. Os métodos funcionam com base em (a) uma
combinação de FCMs e STMMs ou (b) na combinação mencionada, juntamente
com modelos de repetição e um modelo de previsão competitiva.
Testados em vĂĄrios dados sintĂ©ticos e reais mostraram a sua eficiĂȘncia sobre
os métodos do estado-de-arte em termos de taxa de compressão.
A privacidade dos dados genĂłmicos Ă© um tĂłpico recentemente focado nos
desenvolvimentos do campo da medicina personalizada. Propomos uma
ferramenta capaz de representar dados genĂłmicos de maneira criptografada
com segurança e, ao mesmo tempo, compactando as sequencias FASTA e
FASTQ para um fator de trĂȘs. Emprega criptografia AES acompanhada de
um mecanismo de embaralhamento para melhorar a segurança dos dados.
Os resultados mostram que ÂŽe mais rĂĄpido que os algoritmos de uso geral e
especĂfico.
As técnicas de compressão podem ser exploradas para anålise de dados
ómicos. Tendo isso em mente, investigamos a identificação de regiÔes
Ășnicas em uma espĂ©cie em relação a espĂ©cies prĂłximas, que nos podem
dar uma visĂŁo das caracterĂsticas evolutivas. Para esse fim, desenvolvemos
duas ferramentas livres de alinhamento que podem encontrar e visualizar
com precisĂŁo regiĂ”es distintas entre duas coleçÔes de sequĂȘncias de DNA
ou proteĂnas. Testados em humanos modernos em relação a neandertais,
encontråmos vårias regiÔes ausentes nos neandertais que podem expressar
novas funcionalidades associadas à evolução dos humanos modernos.
Por Ășltimo, investigamos a identificação de rearranjos genĂłmicos, que tĂȘm
papéis importantes em desordens genéticas e cancro, empregando uma
técnica de compressão. Para esse fim, desenvolvemos uma ferramenta capaz
de localizar e visualizar com precisĂŁo os rearranjos em pequena e grande
escala entre duas sequĂȘncias genĂłmicas. Os resultados da aplicação da ferramenta
proposta, em vårios dados sintéticos e reais, estão em conformidade
com os resultados parcialmente relatados por abordagens laboratoriais, por
exemplo, anĂĄlise FISH.Programa Doutoral em Engenharia InformĂĄtic
Reconstrução e classificação de sequĂȘncias de ADN desconhecidas
The continuous advances in DNA sequencing technologies and techniques
in metagenomics require reliable reconstruction and accurate classification
methodologies for the diversity increase of the natural repository while contributing
to the organisms' description and organization. However, after
sequencing and de-novo assembly, one of the highest complex challenges
comes from the DNA sequences that do not match or resemble any biological
sequence from the literature. Three main reasons contribute to this
exception: the organism sequence presents high divergence according to the
known organisms from the literature, an irregularity has been created in the
reconstruction process, or a new organism has been sequenced. The inability
to efficiently classify these unknown sequences increases the sample
constitution's uncertainty and becomes a wasted opportunity to discover
new species since they are often discarded.
In this context, the main objective of this thesis is the development and
validation of a tool that provides an efficient computational solution to
solve these three challenges based on an ensemble of experts, namely
compression-based predictors, the distribution of sequence content, and
normalized sequence lengths. The method uses both DNA and amino acid
sequences and provides efficient classification beyond standard referential
comparisons. Unusually, it classifies DNA sequences without resorting directly
to the reference genomes but rather to features that the species biological
sequences share. Specifically, it only makes use of features extracted
individually from each genome without using sequence comparisons.
RFSC was then created as a machine learning classification pipeline that
relies on an ensemble of experts to provide efficient classification in metagenomic
contexts. This pipeline was tested in synthetic and real data, both
achieving precise and accurate results that, at the time of the development
of this thesis, have not been reported in the state-of-the-art. Specifically, it
has achieved an accuracy of approximately 97% in the domain/type classification.Os contĂnuos avanços em tecnologias de sequenciação de ADN e tĂ©cnicas
em meta genómica requerem metodologias de reconstrução confiåveis e de
classificação precisas para o aumento da diversidade do repositório natural,
contribuindo, entretanto, para a descrição e organização dos organismos.
No entanto, após a sequenciação e a montagem de-novo, um dos desafios
mais complexos advĂ©m das sequĂȘncias de ADN que nĂŁo correspondem ou se
assemelham a qualquer sequencia biolĂłgica da literatura. SĂŁo trĂȘs as principais
razÔes que contribuem para essa exceção: uma irregularidade emergiu
no processo de reconstrução, a sequĂȘncia do organismo Ă© altamente dissimilar
dos organismos da literatura, ou um novo e diferente organismo foi
reconstruĂdo. A incapacidade de classificar com eficiĂȘncia essas sequĂȘncias
desconhecidas aumenta a incerteza da constituição da amostra e desperdiça
a oportunidade de descobrir novas espécies, uma vez que muitas vezes são
descartadas.
Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional
eficiente para resolver este desafio com base em um conjunto
de especialistas, nomeadamente preditores baseados em compressĂŁo, a distribuição de conteĂșdo de sequĂȘncia e comprimentos de sequĂȘncia normalizados.
O mĂ©todo usa sequĂȘncias de ADN e de aminoĂĄcidos e fornece classificação eficiente alĂ©m das comparaçÔes referenciais padrĂŁo. Excecionalmente,
ele classifica as sequĂȘncias de ADN sem recorrer diretamente a genomas
de referĂȘncia, mas sim Ă s caracterĂsticas que as sequĂȘncias biolĂłgicas da
espĂ©cie compartilham. Especificamente, ele usa apenas recursos extraĂdos
individualmente de cada genoma sem usar comparaçÔes de sequĂȘncia. AlĂ©m
disso, o pipeline Ă© totalmente automĂĄtico e permite a reconstrução sem referĂȘncia de genomas a partir de reads FASTQ com a garantia adicional de
armazenamento seguro de informaçÔes sensĂveis.
O RFSC é então um pipeline de classificação de aprendizagem automåtica
que se baseia em um conjunto de especialistas para fornecer classificação
eficiente em contextos meta genĂłmicos. Este pipeline foi aplicado em dados
sintéticos e reais, alcançando em ambos resultados precisos e exatos que,
no momento do desenvolvimento desta dissertação, não foram relatados na
literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma
precisĂŁo de aproximadamente 97% na classificação de domĂnio/tipo.Mestrado em Engenharia de Computadores e TelemĂĄtic
Efficient Storage of Genomic Sequences in High Performance Computing Systems
ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction
Advances in Evolutionary Algorithms
With the recent trends towards massive data sets and significant computational power, combined with evolutionary algorithmic advances evolutionary computation is becoming much more relevant to practice. Aim of the book is to present recent improvements, innovative ideas and concepts in a part of a huge EA field
Machine learning for improving heuristic optimisation
Heuristics, metaheuristics and hyper-heuristics are search methodologies which have been preferred by many researchers and practitioners for solving computationally hard combinatorial optimisation problems, whenever the exact methods fail to produce high quality solutions in a reasonable amount of time. In this thesis, we introduce an advanced machine learning technique, namely, tensor analysis, into the field of heuristic optimisation. We show how the relevant data should be collected in tensorial form, analysed and used during the search process. Four case studies are presented to illustrate the capability of single and multi-episode tensor analysis processing data with high and low abstraction levels for improving heuristic optimisation. A single episode tensor analysis using data at a high abstraction level is employed to improve an iterated multi-stage hyper-heuristic for cross-domain heuristic search. The empirical results across six different problem domains from a hyper-heuristic benchmark show that significant overall performance improvement is possible. A similar approach embedding a multi-episode tensor analysis is applied to the nurse rostering problem and evaluated on a benchmark of a diverse collection of instances, obtained from different hospitals across the world.
The empirical results indicate the success of the tensor-based hyper-heuristic, improving upon the best-known solutions for four particular instances. Genetic algorithm is a nature inspired metaheuristic which uses a population of multiple interacting solutions during the search. Mutation is the key variation operator in a genetic algorithm and adjusts the diversity in a population throughout the evolutionary process. Often, a fixed mutation probability is used to perturb the value at each locus, representing a unique component of a given solution. A single episode tensor analysis using data with a low abstraction level is applied to an online bin packing problem, generating locus dependent mutation probabilities. The tensor approach improves the performance of a standard genetic algorithm on almost all instances, significantly. A multi-episode tensor analysis using data with a low abstraction level is embedded into multi-agent cooperative search approach. The empirical results once again show the success of the proposed approach on a benchmark of flow shop problem instances as compared to the approach which does not make use of tensor analysis. The tensor analysis can handle the data with different levels of abstraction leading to a learning approach which can be used within different types of heuristic optimisation methods based on different underlying design philosophies, indeed improving their overall performance
Machine learning for improving heuristic optimisation
Heuristics, metaheuristics and hyper-heuristics are search methodologies which have been preferred by many researchers and practitioners for solving computationally hard combinatorial optimisation problems, whenever the exact methods fail to produce high quality solutions in a reasonable amount of time. In this thesis, we introduce an advanced machine learning technique, namely, tensor analysis, into the field of heuristic optimisation. We show how the relevant data should be collected in tensorial form, analysed and used during the search process. Four case studies are presented to illustrate the capability of single and multi-episode tensor analysis processing data with high and low abstraction levels for improving heuristic optimisation. A single episode tensor analysis using data at a high abstraction level is employed to improve an iterated multi-stage hyper-heuristic for cross-domain heuristic search. The empirical results across six different problem domains from a hyper-heuristic benchmark show that significant overall performance improvement is possible. A similar approach embedding a multi-episode tensor analysis is applied to the nurse rostering problem and evaluated on a benchmark of a diverse collection of instances, obtained from different hospitals across the world.
The empirical results indicate the success of the tensor-based hyper-heuristic, improving upon the best-known solutions for four particular instances. Genetic algorithm is a nature inspired metaheuristic which uses a population of multiple interacting solutions during the search. Mutation is the key variation operator in a genetic algorithm and adjusts the diversity in a population throughout the evolutionary process. Often, a fixed mutation probability is used to perturb the value at each locus, representing a unique component of a given solution. A single episode tensor analysis using data with a low abstraction level is applied to an online bin packing problem, generating locus dependent mutation probabilities. The tensor approach improves the performance of a standard genetic algorithm on almost all instances, significantly. A multi-episode tensor analysis using data with a low abstraction level is embedded into multi-agent cooperative search approach. The empirical results once again show the success of the proposed approach on a benchmark of flow shop problem instances as compared to the approach which does not make use of tensor analysis. The tensor analysis can handle the data with different levels of abstraction leading to a learning approach which can be used within different types of heuristic optimisation methods based on different underlying design philosophies, indeed improving their overall performance
Networks of Liveness in Singer-Songwriting: A practice-based enquiry into developing audio-visual interactive systems and creative strategies for composition and performance.
This enquiry explores the creation and use of computer-based, real-time interactive audio-visual systems for the composition and performance of popular music by solo artists. Using a practice-based methodology, research questions are identified that relate to the impact of incorporating interactive systems into the songwriting process and the liveness of the performances with them. Four approaches to the creation of interactive systems are identified: creating explorative-generative tools, multiple tools for guitar/vocal pieces, typing systems and audio-visual metaphors. A portfolio of ten pieces that use these approaches was developed for live performance. A model of the songwriting process is presented that incorporates system-building and strategies are identified for reconciling the indeterminate, electronic audio output of the system with composed popular music features and instrumental/vocal output. The four system approaches and ten pieces are compared in terms of four aspects of liveness, derived from current theories. It was found that, in terms of overall liveness, a unity to system design facilitated both technological and aesthetic connections between the composition, the system processes and the audio and visual outputs. However, there was considerable variation between the four system approaches in terms of the different aspects of liveness. The enquiry concludes by identifying strategies for maximising liveness in the different system approaches and discussing the connections between liveness and the songwriting process
Novel methods for comparing and evaluating single and metagenomic assemblies
The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments âreadâ by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still heavily relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. The focus of this work is to develop reference-free computational methods to accurately compare and evaluate genome assemblies.
We introduce a reference-free likelihood-based measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics.
Despite the unresolved challenges of single genome assembly, the decreasing costs of sequencing technology has led to a sharp increase in metagenomics projects over the past decade. These projects allow us to better understand the diversity and function of microbial communities found in the environment, including the ocean, Arctic regions, other living organisms, and the human body. We extend our likelihood-based framework and show that we can accurately compare assemblies of these complex bacterial communities.
After an assembly has been produced, it is not an easy task determining what parts of the underlying genome are missing, what parts are mistakes, and what parts are due to experimental artifacts from the sequencing machine. Here we introduce VALET, the first reference-free pipeline that flags regions in metagenomic assemblies that are statistically inconsistent with the data generation process. VALET detects mis-assemblies in publicly available datasets and highlights the current shortcomings in available metagenomic assemblers.
By providing the computational methods for researchers to accurately evalu- ate their assemblies, we decrease the chance of incorrect biological conclusions and misguided future studies
Recommended from our members
A generic approach to behaviour-driven biochemical model construction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Modelling of biochemical systems has received considerable attention over the last decade from bioengineering, biochemistry, computer science, and mathematics. This thesis investigates the applications of computational techniques to computational systems biology, for the construction of biochemical models in terms of topology and kinetic rates. Due to the complexity of biochemical systems, it is natural to construct models representing the biochemical systems incrementally in a piecewise manner. Syntax and semantics of two patterns are defined for the instantiation of components which are extendable, reusable and fundamental building blocks for models composition. We propose and implement a set of genetic operators and composition rules to tackle issues of piecewise composing models from scratch. Quantitative Petri nets are evolved by the genetic operators, and evolutionary process of modelling are guided by the composition rules. Metaheuristic algorithms are widely applied in BioModel Engineering to support intelligent and heuristic analysis of biochemical systems in terms of structure and kinetic rates. We illustrate parameters of biochemical models based on Biochemical Systems Theory, and then the topology and kinetic rates of the models are manipulated by employing evolution strategy and simulated annealing respectively. A new hybrid modelling framework is proposed and implemented for the models construction. Two heuristic algorithms are performed on two embedded layers in the hybrid framework: an outer layer for topology mutation and an inner layer for rates optimization. Moreover, variants of the hybrid piecewise modelling framework are investigated. Regarding flexibility of these variants, various combinations of evolutionary operators, evaluation criteria and design principles can be taken into account. We examine performance of five sets of the variants on specific aspects of modelling. The comparison of variants is not to explicitly show that one variant clearly outperforms the others, but it provides an indication of considering important features for various aspects of the modelling. Because of the very heavy computational demands, the process of modelling is paralleled by employing a grid environment, GridGain. Application of the GridGain and heuristic algorithms to analyze biological processes can support modelling of biochemical systems in a computational manner, which can also benefit mathematical modelling in computer science and bioengineering. We apply our proposed modelling framework to model biochemical systems in a hybrid piecewise manner. Modelling variants of the framework are comparatively studied on specific aims of modelling. Simulation results show that our modelling framework can compose synthetic models exhibiting similar species behaviour, generate models with alternative topologies and obtain general knowledge about key modelling features
- âŠ