116 research outputs found

    Efficient, Dependable Storage of Human Genome Sequencing Data

    Get PDF
    A compreensĂŁo do genoma humano impacta vĂĄrias ĂĄreas da vida. Os dados oriundos do genoma humano sĂŁo enormes pois existem milhĂ”es de amostras a espera de serem sequenciadas e cada genoma humano sequenciado pode ocupar centenas de gigabytes de espaço de armazenamento. Os genomas humanos sĂŁo crĂ­ticos porque sĂŁo extremamente valiosos para a investigação e porque podem fornecer informaçÔes delicadas sobre o estado de saĂșde dos indivĂ­duos, identificar os seus dadores ou atĂ© mesmo revelar informaçÔes sobre os parentes destes. O tamanho e a criticidade destes genomas, para alĂ©m da quantidade de dados produzidos por instituiçÔes mĂ©dicas e de ciĂȘncias da vida, exigem que os sistemas informĂĄticos sejam escalĂĄveis, ao mesmo tempo que sejam seguros, confiĂĄveis, auditĂĄveis e com custos acessĂ­veis. As infraestruturas de armazenamento existentes sĂŁo tĂŁo caras que nĂŁo nos permitem ignorar a eficiĂȘncia de custos no armazenamento de genomas humanos, assim como em geral estas nĂŁo possuem o conhecimento e os mecanismos adequados para proteger a privacidade dos dadores de amostras biolĂłgicas. Esta tese propĂ”e um sistema de armazenamento de genomas humanos eficiente, seguro e auditĂĄvel para instituiçÔes mĂ©dicas e de ciĂȘncias da vida. Ele aprimora os ecossistemas de armazenamento tradicionais com tĂ©cnicas de privacidade, redução do tamanho dos dados e auditabilidade a fim de permitir o uso eficiente e confiĂĄvel de infraestruturas pĂșblicas de computação em nuvem para armazenar genomas humanos. As contribuiçÔes desta tese incluem (1) um estudo sobre a sensibilidade Ă  privacidade dos genomas humanos; (2) um mĂ©todo para detetar sistematicamente as porçÔes dos genomas que sĂŁo sensĂ­veis Ă  privacidade; (3) algoritmos de redução do tamanho de dados, especializados para dados de genomas sequenciados; (4) um esquema de auditoria independente para armazenamento disperso e seguro de dados; e (5) um fluxo de armazenamento completo que obtĂ©m garantias razoĂĄveis de proteção, segurança e confiabilidade a custos modestos (por exemplo, menos de 1/Genoma/Ano),integrandoosmecanismospropostosaconfigurac\co~esdearmazenamentoapropriadasTheunderstandingofhumangenomeimpactsseveralareasofhumanlife.Datafromhumangenomesismassivebecausetherearemillionsofsamplestobesequenced,andeachsequencedhumangenomemaysizehundredsofgigabytes.Humangenomesarecriticalbecausetheyareextremelyvaluabletoresearchandmayprovidehintsonindividuals’healthstatus,identifytheirdonors,orrevealinformationaboutdonors’relatives.Theirsizeandcriticality,plustheamountofdatabeingproducedbymedicalandlife−sciencesinstitutions,requiresystemstoscalewhilebeingsecure,dependable,auditable,andaffordable.Currentstorageinfrastructuresaretooexpensivetoignorecostefficiencyinstoringhumangenomes,andtheylacktheproperknowledgeandmechanismstoprotecttheprivacyofsampledonors.Thisthesisproposesanefficientstoragesystemforhumangenomesthatmedicalandlifesciencesinstitutionsmaytrustandafford.Itenhancestraditionalstorageecosystemswithprivacy−aware,data−reduction,andauditabilitytechniquestoenabletheefficient,dependableuseofmulti−tenantinfrastructurestostorehumangenomes.Contributionsfromthisthesisinclude(1)astudyontheprivacy−sensitivityofhumangenomes;(2)todetectgenomes’privacy−sensitiveportionssystematically;(3)specialiseddatareductionalgorithmsforsequencingdata;(4)anindependentauditabilityschemeforsecuredispersedstorage;and(5)acompletestoragepipelinethatobtainsreasonableprivacyprotection,security,anddependabilityguaranteesatmodestcosts(e.g.,lessthan1/Genoma/Ano), integrando os mecanismos propostos a configuraçÔes de armazenamento apropriadasThe understanding of human genome impacts several areas of human life. Data from human genomes is massive because there are millions of samples to be sequenced, and each sequenced human genome may size hundreds of gigabytes. Human genomes are critical because they are extremely valuable to research and may provide hints on individuals’ health status, identify their donors, or reveal information about donors’ relatives. Their size and criticality, plus the amount of data being produced by medical and life-sciences institutions, require systems to scale while being secure, dependable, auditable, and affordable. Current storage infrastructures are too expensive to ignore cost efficiency in storing human genomes, and they lack the proper knowledge and mechanisms to protect the privacy of sample donors. This thesis proposes an efficient storage system for human genomes that medical and lifesciences institutions may trust and afford. It enhances traditional storage ecosystems with privacy-aware, data-reduction, and auditability techniques to enable the efficient, dependable use of multi-tenant infrastructures to store human genomes. Contributions from this thesis include (1) a study on the privacy-sensitivity of human genomes; (2) to detect genomes’ privacy-sensitive portions systematically; (3) specialised data reduction algorithms for sequencing data; (4) an independent auditability scheme for secure dispersed storage; and (5) a complete storage pipeline that obtains reasonable privacy protection, security, and dependability guarantees at modest costs (e.g., less than 1/Genome/Year) by integrating the proposed mechanisms with appropriate storage configurations

    Modelos de compressĂŁo e ferramentas para dados Ăłmicos

    Get PDF
    The ever-increasing growth of the development of high-throughput sequencing technologies and as a consequence, generation of a huge volume of data, has revolutionized biological research and discovery. Motivated by that, we investigate in this thesis the methods which are capable of providing an efficient representation of omics data in compressed or encrypted manner, and then, we employ them to analyze omics data. First and foremost, we describe a number of measures for the purpose of quantifying information in and between omics sequences. Then, we present finite-context models (FCMs), substitution-tolerant Markov models (STMMs) and a combination of the two, which are specialized in modeling biological data, in order for data compression and analysis. To ease the storage of the aforementioned data deluge, we design two lossless data compressors for genomic and one for proteomic data. The methods work on the basis of (a) a combination of FCMs and STMMs or (b) the mentioned combination along with repeat models and a competitive prediction model. Tested on various synthetic and real data showed their outperformance over the previously proposed methods in terms of compression ratio. Privacy of genomic data is a topic that has been recently focused by developments in the field of personalized medicine. We propose a tool that is able to represent genomic data in a securely encrypted fashion, and at the same time, is able to compact FASTA and FASTQ sequences by a factor of three. It employs AES encryption accompanied by a shuffling mechanism for improving the data security. The results show it is faster than general-purpose and special-purpose algorithms. Compression techniques can be employed for analysis of omics data. Having this in mind, we investigate the identification of unique regions in a species with respect to close species, that can give us an insight into evolutionary traits. For this purpose, we design two alignment-free tools that can accurately find and visualize distinct regions among two collections of DNA or protein sequences. Tested on modern humans with respect to Neanderthals, we found a number of absent regions in Neanderthals that may express new functionalities associated with evolution of modern humans. Finally, we investigate the identification of genomic rearrangements, that have important roles in genetic disorders and cancer, by employing a compression technique. For this purpose, we design a tool that is able to accurately localize and visualize small- and large-scale rearrangements between two genomic sequences. The results of applying the proposed tool on several synthetic and real data conformed to the results partially reported by wet laboratory approaches, e.g., FISH analysis.O crescente crescimento do desenvolvimento de tecnologias de sequenciamento de alto rendimento e, como consequĂȘncia, a geração de um enorme volume de dados, revolucionou a pesquisa e descoberta biolĂłgica. Motivados por isso, nesta tese investigamos os mĂ©todos que fornecem uma representação eficiente de dados Ăłmicros de maneira compactada ou criptografada e, posteriormente, os usamos para anĂĄlise. Em primeiro lugar, descrevemos uma sĂ©rie de medidas com o objetivo de quantificar informação em e entre sequencias Ăłmicas. Em seguida, apresentamos modelos de contexto finito (FCMs), modelos de Markov tolerantes a substituição (STMMs) e uma combinação dos dois, especializados na modelagem de dados biolĂłgicos, para compactação e anĂĄlise de dados. Para facilitar o armazenamento do dilĂșvio de dados acima mencionado, desenvolvemos dois compressores de dados sem perda para dados genĂłmicos e um para dados proteĂłmicos. Os mĂ©todos funcionam com base em (a) uma combinação de FCMs e STMMs ou (b) na combinação mencionada, juntamente com modelos de repetição e um modelo de previsĂŁo competitiva. Testados em vĂĄrios dados sintĂ©ticos e reais mostraram a sua eficiĂȘncia sobre os mĂ©todos do estado-de-arte em termos de taxa de compressĂŁo. A privacidade dos dados genĂłmicos Ă© um tĂłpico recentemente focado nos desenvolvimentos do campo da medicina personalizada. Propomos uma ferramenta capaz de representar dados genĂłmicos de maneira criptografada com segurança e, ao mesmo tempo, compactando as sequencias FASTA e FASTQ para um fator de trĂȘs. Emprega criptografia AES acompanhada de um mecanismo de embaralhamento para melhorar a segurança dos dados. Os resultados mostram que ÂŽe mais rĂĄpido que os algoritmos de uso geral e especĂ­fico. As tĂ©cnicas de compressĂŁo podem ser exploradas para anĂĄlise de dados Ăłmicos. Tendo isso em mente, investigamos a identificação de regiĂ”es Ășnicas em uma espĂ©cie em relação a espĂ©cies prĂłximas, que nos podem dar uma visĂŁo das caracterĂ­sticas evolutivas. Para esse fim, desenvolvemos duas ferramentas livres de alinhamento que podem encontrar e visualizar com precisĂŁo regiĂ”es distintas entre duas coleçÔes de sequĂȘncias de DNA ou proteĂ­nas. Testados em humanos modernos em relação a neandertais, encontrĂĄmos vĂĄrias regiĂ”es ausentes nos neandertais que podem expressar novas funcionalidades associadas Ă  evolução dos humanos modernos. Por Ășltimo, investigamos a identificação de rearranjos genĂłmicos, que tĂȘm papĂ©is importantes em desordens genĂ©ticas e cancro, empregando uma tĂ©cnica de compressĂŁo. Para esse fim, desenvolvemos uma ferramenta capaz de localizar e visualizar com precisĂŁo os rearranjos em pequena e grande escala entre duas sequĂȘncias genĂłmicas. Os resultados da aplicação da ferramenta proposta, em vĂĄrios dados sintĂ©ticos e reais, estĂŁo em conformidade com os resultados parcialmente relatados por abordagens laboratoriais, por exemplo, anĂĄlise FISH.Programa Doutoral em Engenharia InformĂĄtic

    Reconstrução e classificação de sequĂȘncias de ADN desconhecidas

    Get PDF
    The continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.Os contĂ­nuos avanços em tecnologias de sequenciação de ADN e tĂ©cnicas em meta genĂłmica requerem metodologias de reconstrução confiĂĄveis e de classificação precisas para o aumento da diversidade do repositĂłrio natural, contribuindo, entretanto, para a descrição e organização dos organismos. No entanto, apĂłs a sequenciação e a montagem de-novo, um dos desafios mais complexos advĂ©m das sequĂȘncias de ADN que nĂŁo correspondem ou se assemelham a qualquer sequencia biolĂłgica da literatura. SĂŁo trĂȘs as principais razĂ”es que contribuem para essa exceção: uma irregularidade emergiu no processo de reconstrução, a sequĂȘncia do organismo Ă© altamente dissimilar dos organismos da literatura, ou um novo e diferente organismo foi reconstruĂ­do. A incapacidade de classificar com eficiĂȘncia essas sequĂȘncias desconhecidas aumenta a incerteza da constituição da amostra e desperdiça a oportunidade de descobrir novas espĂ©cies, uma vez que muitas vezes sĂŁo descartadas. Neste contexto, o principal objetivo desta tese Ă© fornecer uma solução computacional eficiente para resolver este desafio com base em um conjunto de especialistas, nomeadamente preditores baseados em compressĂŁo, a distribuição de conteĂșdo de sequĂȘncia e comprimentos de sequĂȘncia normalizados. O mĂ©todo usa sequĂȘncias de ADN e de aminoĂĄcidos e fornece classificação eficiente alĂ©m das comparaçÔes referenciais padrĂŁo. Excecionalmente, ele classifica as sequĂȘncias de ADN sem recorrer diretamente a genomas de referĂȘncia, mas sim Ă s caracterĂ­sticas que as sequĂȘncias biolĂłgicas da espĂ©cie compartilham. Especificamente, ele usa apenas recursos extraĂ­dos individualmente de cada genoma sem usar comparaçÔes de sequĂȘncia. AlĂ©m disso, o pipeline Ă© totalmente automĂĄtico e permite a reconstrução sem referĂȘncia de genomas a partir de reads FASTQ com a garantia adicional de armazenamento seguro de informaçÔes sensĂ­veis. O RFSC Ă© entĂŁo um pipeline de classificação de aprendizagem automĂĄtica que se baseia em um conjunto de especialistas para fornecer classificação eficiente em contextos meta genĂłmicos. Este pipeline foi aplicado em dados sintĂ©ticos e reais, alcançando em ambos resultados precisos e exatos que, no momento do desenvolvimento desta dissertação, nĂŁo foram relatados na literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma precisĂŁo de aproximadamente 97% na classificação de domĂ­nio/tipo.Mestrado em Engenharia de Computadores e TelemĂĄtic

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    Advances in Evolutionary Algorithms

    Get PDF
    With the recent trends towards massive data sets and significant computational power, combined with evolutionary algorithmic advances evolutionary computation is becoming much more relevant to practice. Aim of the book is to present recent improvements, innovative ideas and concepts in a part of a huge EA field

    Machine learning for improving heuristic optimisation

    Get PDF
    Heuristics, metaheuristics and hyper-heuristics are search methodologies which have been preferred by many researchers and practitioners for solving computationally hard combinatorial optimisation problems, whenever the exact methods fail to produce high quality solutions in a reasonable amount of time. In this thesis, we introduce an advanced machine learning technique, namely, tensor analysis, into the field of heuristic optimisation. We show how the relevant data should be collected in tensorial form, analysed and used during the search process. Four case studies are presented to illustrate the capability of single and multi-episode tensor analysis processing data with high and low abstraction levels for improving heuristic optimisation. A single episode tensor analysis using data at a high abstraction level is employed to improve an iterated multi-stage hyper-heuristic for cross-domain heuristic search. The empirical results across six different problem domains from a hyper-heuristic benchmark show that significant overall performance improvement is possible. A similar approach embedding a multi-episode tensor analysis is applied to the nurse rostering problem and evaluated on a benchmark of a diverse collection of instances, obtained from different hospitals across the world. The empirical results indicate the success of the tensor-based hyper-heuristic, improving upon the best-known solutions for four particular instances. Genetic algorithm is a nature inspired metaheuristic which uses a population of multiple interacting solutions during the search. Mutation is the key variation operator in a genetic algorithm and adjusts the diversity in a population throughout the evolutionary process. Often, a fixed mutation probability is used to perturb the value at each locus, representing a unique component of a given solution. A single episode tensor analysis using data with a low abstraction level is applied to an online bin packing problem, generating locus dependent mutation probabilities. The tensor approach improves the performance of a standard genetic algorithm on almost all instances, significantly. A multi-episode tensor analysis using data with a low abstraction level is embedded into multi-agent cooperative search approach. The empirical results once again show the success of the proposed approach on a benchmark of flow shop problem instances as compared to the approach which does not make use of tensor analysis. The tensor analysis can handle the data with different levels of abstraction leading to a learning approach which can be used within different types of heuristic optimisation methods based on different underlying design philosophies, indeed improving their overall performance

    Machine learning for improving heuristic optimisation

    Get PDF
    Heuristics, metaheuristics and hyper-heuristics are search methodologies which have been preferred by many researchers and practitioners for solving computationally hard combinatorial optimisation problems, whenever the exact methods fail to produce high quality solutions in a reasonable amount of time. In this thesis, we introduce an advanced machine learning technique, namely, tensor analysis, into the field of heuristic optimisation. We show how the relevant data should be collected in tensorial form, analysed and used during the search process. Four case studies are presented to illustrate the capability of single and multi-episode tensor analysis processing data with high and low abstraction levels for improving heuristic optimisation. A single episode tensor analysis using data at a high abstraction level is employed to improve an iterated multi-stage hyper-heuristic for cross-domain heuristic search. The empirical results across six different problem domains from a hyper-heuristic benchmark show that significant overall performance improvement is possible. A similar approach embedding a multi-episode tensor analysis is applied to the nurse rostering problem and evaluated on a benchmark of a diverse collection of instances, obtained from different hospitals across the world. The empirical results indicate the success of the tensor-based hyper-heuristic, improving upon the best-known solutions for four particular instances. Genetic algorithm is a nature inspired metaheuristic which uses a population of multiple interacting solutions during the search. Mutation is the key variation operator in a genetic algorithm and adjusts the diversity in a population throughout the evolutionary process. Often, a fixed mutation probability is used to perturb the value at each locus, representing a unique component of a given solution. A single episode tensor analysis using data with a low abstraction level is applied to an online bin packing problem, generating locus dependent mutation probabilities. The tensor approach improves the performance of a standard genetic algorithm on almost all instances, significantly. A multi-episode tensor analysis using data with a low abstraction level is embedded into multi-agent cooperative search approach. The empirical results once again show the success of the proposed approach on a benchmark of flow shop problem instances as compared to the approach which does not make use of tensor analysis. The tensor analysis can handle the data with different levels of abstraction leading to a learning approach which can be used within different types of heuristic optimisation methods based on different underlying design philosophies, indeed improving their overall performance

    Networks of Liveness in Singer-Songwriting: A practice-based enquiry into developing audio-visual interactive systems and creative strategies for composition and performance.

    Get PDF
    This enquiry explores the creation and use of computer-based, real-time interactive audio-visual systems for the composition and performance of popular music by solo artists. Using a practice-based methodology, research questions are identified that relate to the impact of incorporating interactive systems into the songwriting process and the liveness of the performances with them. Four approaches to the creation of interactive systems are identified: creating explorative-generative tools, multiple tools for guitar/vocal pieces, typing systems and audio-visual metaphors. A portfolio of ten pieces that use these approaches was developed for live performance. A model of the songwriting process is presented that incorporates system-building and strategies are identified for reconciling the indeterminate, electronic audio output of the system with composed popular music features and instrumental/vocal output. The four system approaches and ten pieces are compared in terms of four aspects of liveness, derived from current theories. It was found that, in terms of overall liveness, a unity to system design facilitated both technological and aesthetic connections between the composition, the system processes and the audio and visual outputs. However, there was considerable variation between the four system approaches in terms of the different aspects of liveness. The enquiry concludes by identifying strategies for maximising liveness in the different system approaches and discussing the connections between liveness and the songwriting process

    Novel methods for comparing and evaluating single and metagenomic assemblies

    Get PDF
    The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still heavily relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. The focus of this work is to develop reference-free computational methods to accurately compare and evaluate genome assemblies. We introduce a reference-free likelihood-based measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. Despite the unresolved challenges of single genome assembly, the decreasing costs of sequencing technology has led to a sharp increase in metagenomics projects over the past decade. These projects allow us to better understand the diversity and function of microbial communities found in the environment, including the ocean, Arctic regions, other living organisms, and the human body. We extend our likelihood-based framework and show that we can accurately compare assemblies of these complex bacterial communities. After an assembly has been produced, it is not an easy task determining what parts of the underlying genome are missing, what parts are mistakes, and what parts are due to experimental artifacts from the sequencing machine. Here we introduce VALET, the first reference-free pipeline that flags regions in metagenomic assemblies that are statistically inconsistent with the data generation process. VALET detects mis-assemblies in publicly available datasets and highlights the current shortcomings in available metagenomic assemblers. By providing the computational methods for researchers to accurately evalu- ate their assemblies, we decrease the chance of incorrect biological conclusions and misguided future studies
    • 

    corecore