3 research outputs found

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    Unlocking Large-Scale Genomics

    Get PDF
    The dramatic progress in DNA sequencing technology over the last decade, with the revolutionary introduction of next-generation sequencing, has brought with it opportunities and difficulties. Indeed, the opportunity to study the genomes of any species at an unprecedented level of detail has come accompanied by the difficulty in scaling analysis to handle the tremendous data generation rates of the sequencing machinery and scaling operational procedures to handle the increasing sample sizes in ever larger sequencing studies. This dissertation presents work that strives to address both these problems. The first contribution, inspired by the success of data-driven industry, is the Seal suite of tools which harnesses the scalability of the Hadoop framework to accelerate the analysis of sequencing data and keep up with the sustained throughput of the sequencing machines. The second contribution, addressing the second problem, is a system is developed to automate the standard analysis procedures at a typical sequencing center. Additional work is presented to make the first two contributions compatible with each other, as to provide a complete solution for a sequencing operation and to simplify their use. Finally, the work presented here has been integrated into the production operations at the CRS4 Sequencing Lab, helping it scale its operation while reducing personnel requirements

    ΠžΠΊΡ€ΡƒΠΆΠ΅ΡšΠ΅ Π·Π° Π°Π½Π°Π»ΠΈΠ·Ρƒ ΠΈ ΠΎΡ†Π΅Π½Ρƒ ΠΊΠ²Π°Π»ΠΈΡ‚Π΅Ρ‚Π° Π²Π΅Π»ΠΈΠΊΠΈΡ… ΠΈ ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ… ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ°

    Get PDF
    Linking and publishing data in the Linked Open Data format increases the interoperability and discoverability of resources over the Web. To accomplish this, the process comprises several design decisions, based on the Linked Data principles that, on one hand, recommend to use standards for the representation and the access to data on the Web, and on the other hand to set hyperlinks between data from different sources. Despite the efforts of the World Wide Web Consortium (W3C), being the main international standards organization for the World Wide Web, there is no one tailored formula for publishing data as Linked Data. In addition, the quality of the published Linked Open Data (LOD) is a fundamental issue, and it is yet to be thoroughly managed and considered. In this doctoral thesis, the main objective is to design and implement a novel framework for selecting, analyzing, converting, interlinking, and publishing data from diverse sources, simultaneously paying great attention to quality assessment throughout all steps and modules of the framework. The goal is to examine whether and to what extent are the Semantic Web technologies applicable for merging data from different sources and enabling end-users to obtain additional information that was not available in individual datasets, in addition to the integration into the Semantic Web community space. Additionally, the Ph.D. thesis intends to validate the applicability of the process in the specific and demanding use case, i.e. for creating and publishing an Arabic Linked Drug Dataset, based on open drug datasets from selected Arabic countries and to discuss the quality issues observed in the linked data life-cycle. To that end, in this doctoral thesis, a Semantic Data Lake was established in the pharmaceutical domain that allows further integration and developing different business services on top of the integrated data sources. Through data representation in an open machine-readable format, the approach offers an optimum solution for information and data dissemination for building domain-specific applications, and to enrich and gain value from the original dataset. This thesis showcases how the pharmaceutical domain benefits from the evolving research trends for building competitive advantages. However, as it is elaborated in this thesis, a better understanding of the specifics of the Arabic language is required to extend linked data technologies utilization in targeted Arabic organizations.ПовСзивањС ΠΈ ΠΎΠ±Ρ˜Π°Π²Ρ™ΠΈΠ²Π°ΡšΠ΅ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° Ρƒ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Ρƒ "ПовСзани ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈ ΠΏΠΎΠ΄Π°Ρ†ΠΈ" (Π΅Π½Π³. Linked Open Data) ΠΏΠΎΠ²Π΅Ρ›Π°Π²Π° интСропСрабилност ΠΈ могућности Π·Π° ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ рСсурса ΠΏΡ€Π΅ΠΊΠΎ Web-Π°. ΠŸΡ€ΠΎΡ†Π΅Ρ јС заснован Π½Π° Linked Data ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΈΠΌΠ° (W3C, 2006) који са јСднС странС Π΅Π»Π°Π±ΠΎΡ€ΠΈΡ€Π° стандардС Π·Π° ΠΏΡ€Π΅Π΄ΡΡ‚Π°Π²Ρ™Π°ΡšΠ΅ ΠΈ приступ ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ° Π½Π° WΠ΅Π±Ρƒ (RDF, OWL, SPARQL), Π° са Π΄Ρ€ΡƒΠ³Π΅ странС, ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΈ ΡΡƒΠ³Π΅Ρ€ΠΈΡˆΡƒ ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ΅ Ρ…ΠΈΠΏΠ΅Ρ€Π²Π΅Π·Π° ΠΈΠ·ΠΌΠ΅Ρ’Ρƒ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΈΠ· Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚ΠΈΡ… ΠΈΠ·Π²ΠΎΡ€Π°. Упркос Π½Π°ΠΏΠΎΡ€ΠΈΠΌΠ° W3C ΠΊΠΎΠ½Π·ΠΎΡ€Ρ†ΠΈΡ˜ΡƒΠΌΠ° (W3C јС Π³Π»Π°Π²Π½Π° ΠΌΠ΅Ρ’ΡƒΠ½Π°Ρ€ΠΎΠ΄Π½Π° ΠΎΡ€Π³Π°Π½ΠΈΠ·Π°Ρ†ΠΈΡ˜Π° Π·Π° стандардС Π·Π° Web-Ρƒ), Π½Π΅ ΠΏΠΎΡΡ‚ΠΎΡ˜ΠΈ Ρ˜Π΅Π΄ΠΈΠ½ΡΡ‚Π²Π΅Π½Π° Ρ„ΠΎΡ€ΠΌΡƒΠ»Π° Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Ρƒ процСса ΠΎΠ±Ρ˜Π°Π²Ρ™ΠΈΠ²Π°ΡšΠ΅ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° Ρƒ Linked Data Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Ρƒ. Π£Π·ΠΈΠΌΠ°Ρ˜ΡƒΡ›ΠΈ Ρƒ ΠΎΠ±Π·ΠΈΡ€ Π΄Π° јС ΠΊΠ²Π°Π»ΠΈΡ‚Π΅Ρ‚ ΠΎΠ±Ρ˜Π°Π²Ρ™Π΅Π½ΠΈΡ… ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ… ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈΡ… ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΎΠ΄Π»ΡƒΡ‡ΡƒΡ˜ΡƒΡ›ΠΈ Π·Π° Π±ΡƒΠ΄ΡƒΡ›ΠΈ Ρ€Π°Π·Π²ΠΎΡ˜ Web-Π°, Ρƒ овој Π΄ΠΎΠΊΡ‚ΠΎΡ€ΡΠΊΠΎΡ˜ Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜ΠΈ, Π³Π»Π°Π²Π½ΠΈ Ρ†ΠΈΡ™ јС (1) дизајн ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Π° ΠΈΠ½ΠΎΠ²Π°Ρ‚ΠΈΠ²Π½ΠΎΠ³ ΠΎΠΊΠ²ΠΈΡ€Π° Π·Π° ΠΈΠ·Π±ΠΎΡ€, Π°Π½Π°Π»ΠΈΠ·Ρƒ, ΠΊΠΎΠ½Π²Π΅Ρ€Π·ΠΈΡ˜Ρƒ, мСђусобно повСзивањС ΠΈ ΠΎΠ±Ρ˜Π°Π²Ρ™ΠΈΠ²Π°ΡšΠ΅ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΈΠ· Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚ΠΈΡ… ΠΈΠ·Π²ΠΎΡ€Π° ΠΈ (2) Π°Π½Π°Π»ΠΈΠ·Π° ΠΏΡ€ΠΈΠΌΠ΅Π½Π° ΠΎΠ²ΠΎΠ³ приступа Ρƒ Ρ„Π°Ρ€ΠΌΠ°Ρ†eутском Π΄ΠΎΠΌΠ΅Π½Ρƒ. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° докторска Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜Π° Π΄Π΅Ρ‚Π°Ρ™Π½ΠΎ ΠΈΡΡ‚Ρ€Π°ΠΆΡƒΡ˜Π΅ ΠΏΠΈΡ‚Π°ΡšΠ΅ ΠΊΠ²Π°Π»ΠΈΡ‚Π΅Ρ‚Π° Π²Π΅Π»ΠΈΠΊΠΈΡ… ΠΈ ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΡ… СкосистСма ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° (Π΅Π½Π³. Linked Data Ecosystems), ΡƒΠ·ΠΈΠΌΠ°Ρ˜ΡƒΡ›ΠΈ Ρƒ ΠΎΠ±Π·ΠΈΡ€ могућност ΠΏΠΎΠ½ΠΎΠ²Π½ΠΎΠ³ ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ° ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈΡ… ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ°. Π Π°Π΄ јС мотивисан ΠΏΠΎΡ‚Ρ€Π΅Π±ΠΎΠΌ Π΄Π° сС ΠΎΠΌΠΎΠ³ΡƒΡ›ΠΈ истраТивачима ΠΈΠ· арапских Π·Π΅ΠΌΠ°Ρ™Π° Π΄Π° ΡƒΠΏΠΎΡ‚Ρ€Π΅Π±ΠΎΠΌ сСмантичких Π²Π΅Π± Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° ΠΏΠΎΠ²Π΅ΠΆΡƒ својС ΠΏΠΎΠ΄Π°Ρ‚ΠΊΠ΅ са ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈΠΌ ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ°, ΠΊΠ°ΠΎ Π½ΠΏΡ€. DBpedia-јом. Π¦ΠΈΡ™ јС Π΄Π° сС испита Π΄Π° Π»ΠΈ ΠΎΡ‚Π²ΠΎΡ€Π΅Π½ΠΈ ΠΏΠΎΠ΄Π°Ρ†ΠΈ ΠΈΠ· Арапских Π·Π΅ΠΌΠ°Ρ™Π° ΠΎΠΌΠΎΠ³ΡƒΡ›Π°Π²Π°Ρ˜Ρƒ ΠΊΡ€Π°Ρ˜ΡšΠΈΠΌ корисницима Π΄Π° Π΄ΠΎΠ±ΠΈΡ˜Ρƒ Π΄ΠΎΠ΄Π°Ρ‚Π½Π΅ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΡ˜Π΅ којС нису доступнС Ρƒ ΠΏΠΎΡ˜Π΅Π΄ΠΈΠ½Π°Ρ‡Π½ΠΈΠΌ скуповима ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ°, ΠΏΠΎΡ€Π΅Π΄ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΡ˜Π΅ Ρƒ сСмантички WΠ΅Π± простор. Докторска Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜Π° ΠΏΡ€Π΅Π΄Π»Π°ΠΆΠ΅ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ»ΠΎΠ³ΠΈΡ˜Ρƒ Π·Π° Ρ€Π°Π·Π²ΠΎΡ˜ Π°ΠΏΠ»ΠΈΠΊΠ°Ρ†ΠΈΡ˜Π΅ Π·Π° Ρ€Π°Π΄ са ΠΏΠΎΠ²Π΅Π·Π°Π½ΠΈΠΌ (Linked) ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ° ΠΈ ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚ΠΈΡ€Π° софтвСрско Ρ€Π΅ΡˆΠ΅ΡšΠ΅ којС ΠΎΠΌΠΎΠ³ΡƒΡ›ΡƒΡ˜Π΅ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ консолидованог скупа ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° ΠΎ Π»Π΅ΠΊΠΎΠ²ΠΈΠΌΠ° ΠΈΠ· ΠΈΠ·Π°Π±Ρ€Π°Π½ΠΈΡ… арапских Π·Π΅ΠΌΠ°Ρ™Π°. Консолидовани скуп ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° јС ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚ΠΈΡ€Π°Π½ Ρƒ ΠΎΠ±Π»ΠΈΠΊΡƒ Π‘Π΅ΠΌΠ°Π½Ρ‚ΠΈΡ‡ΠΊΠΎΠ³ Ρ˜Π΅Π·Π΅Ρ€Π° ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° (Π΅Π½Π³. Semantic Data Lake). Ова Ρ‚Π΅Π·Π° ΠΏΠΎΠΊΠ°Π·ΡƒΡ˜Π΅ ΠΊΠ°ΠΊΠΎ фармацСутска ΠΈΠ½Π΄ΡƒΡΡ‚Ρ€ΠΈΡ˜Π° ΠΈΠΌΠ° користи ΠΎΠ΄ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅ ΠΈΠ½ΠΎΠ²Π°Ρ‚ΠΈΠ²Π½ΠΈΡ… Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° ΠΈ истраТивачких Ρ‚Ρ€Π΅Π½Π΄ΠΎΠ²Π° ΠΈΠ· области сСмантичких Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π°. ΠœΠ΅Ρ’ΡƒΡ‚ΠΈΠΌ, ΠΊΠ°ΠΊΠΎ јС Π΅Π»Π°Π±ΠΎΡ€ΠΈΡ€Π°Π½ΠΎ Ρƒ овој Ρ‚Π΅Π·ΠΈ, ΠΏΠΎΡ‚Ρ€Π΅Π±Π½ΠΎ јС Π±ΠΎΡ™Π΅ Ρ€Π°Π·ΡƒΠΌΠ΅Π²Π°ΡšΠ΅ спСцифичности арапског јСзика Π·Π° ΠΈΠΌΠΏΠ»Π΅ΠΌΠ΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Ρƒ Linked Data Π°Π»Π°Ρ‚Π° ΠΈ ΡšΡƒΡ…ΠΎΠ²Ρƒ ΠΏΡ€ΠΈΠΌΠ΅Π½Ρƒ са ΠΏΠΎΠ΄Π°Ρ†ΠΈΠΌΠ° ΠΈΠ· Арапских Π·Π΅ΠΌΠ°Ρ™Π°
    corecore