1,494 research outputs found

    Data compression for sequencing data

    Get PDF
    Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question “why compression” in a quantitative manner. Then we also answer the questions “what” and “how”, by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question “why compression” and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology

    Analysis and review of the possibility of using the generative model as a compression technique in DNA data storage: review and future research agenda

    Get PDF
    The amount of data in this world is getting higher, and overwriting technology also has severe challenges. Data growth is expected to grow to 175 ZB by 2025. Data storage technology in DNA is an alternative technology with potential in information storage, mainly digital data. One of the stages of storing information on DNA is synthesis. This synthesis process costs very high, so it is necessary to integrate compression techniques for digital data to minimize the costs incurred. One of the models used in compression techniques is the generative model. This paper aims to see if compression using this generative model allows it to be integrated into data storage methods on DNA. To this end, we have conducted a Systematic Literature Review using the PRISMA method in selecting papers. We took the source of the papers from four leading databases and other additional databases. Out of 2440 papers, we finally decided on 34 primary papers for detailed analysis. This systematic literature review (SLR) presents and categorizes based on research questions, namely discussing machine learning methods applied in DNA storage, identifying compression techniques for DNA storage, knowing the role of deep learning in the compression process for DNA storage, knowing how generative models are associated with deep learning, knowing how generative models are applied in the compression process, and knowing latent space can be formed. The study highlights open problems that need to be solved and provides an identified research direction

    New algorithms and methods for protein and DNA sequence comparison

    Get PDF

    Fine-Grained Provenance And Applications To Data Analytics Computation

    Get PDF
    Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; prove-nance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks – for data types such as strings, images, etc.Additionally, we need a provenance archival layer to store and manage the tracked fine-grained prove-nance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    Efficient compression of large repetitive strings

    Get PDF
    When is comes to managing large volumes of data, general-purpose compressors such as gzip are ubiquitous. They are fast, practical and available on every modern platform from standard desktops to mobile devices. These tools exploit local redundancy in a text using a fixed-size sliding window. This window is usually very small relative to the text, however, in principle it can be as large as available memory. The window acts as a dictionary. Compression is achieved by replacing substrings with pointers to previous occurrences found in the dictionary. This type of algorithm becomes problematic when dealing with collections that are larger than physical memory, as it fails to capture any non-local redundancy, that is, repetition that occurs outside of its search window. With rapid growth in the already enormous amount of data we store and process there is a pressing need for improving compression effectiveness, reducing both storage requirements and decompression costs. However, many systems still use general-purpose compression tools on large highly repetitive data collections. In this thesis we focus on addressing this issue. We explore compression in a variety of domains where large volumes of data need to be stored and accessed, and general-purpose compression tools are cannon. First we discuss our work on web corpus compression, then we discuss the implementation of a practical index for repetitive texts that gives strong theoretical bounds in terms of size and access, and finally, we discuss our work on compression of high-throughput sequencing reads. We show that in all cases, our new methods improve on current techniques in both run-time and compression effectiveness, and provide important functionality such as fast decoding and random access

    A standards-based ICT framework to enable a service-oriented approach to clinical decision support

    Get PDF
    This research provides evidence that standards based Clinical Decision Support (CDS) at the point of care is an essential ingredient of electronic healthcare service delivery. A Service Oriented Architecture (SOA) based solution is explored, that serves as a task management system to coordinate complex distributed and disparate IT systems, processes and resources (human and computer) to provide standards based CDS. This research offers a solution to the challenges in implementing computerised CDS such as integration with heterogeneous legacy systems. Reuse of components and services to reduce costs and save time. The benefits of a sharable CDS service that can be reused by different healthcare practitioners to provide collaborative patient care is demonstrated. This solution provides orchestration among different services by extracting data from sources like patient databases, clinical knowledge bases and evidence-based clinical guidelines (CGs) in order to facilitate multiple CDS requests coming from different healthcare settings. This architecture aims to aid users at different levels of Healthcare Delivery Organizations (HCOs) to maintain a CDS repository, along with monitoring and managing services, thus enabling transparency. The research employs the Design Science research methodology (DSRM) combined with The Open Group Architecture Framework (TOGAF), an open source group initiative for Enterprise Architecture Framework (EAF). DSRM’s iterative capability addresses the rapidly evolving nature of workflows in healthcare. This SOA based solution uses standards-based open source technologies and platforms, the latest healthcare standards by HL7 and OMG, Decision Support Service (DSS) and Retrieve, Update Locate Service (RLUS) standard. Combining business process management (BPM) technologies, business rules with SOA ensures the HCO’s capability to manage its processes. This architectural solution is evaluated by successfully implementing evidence based CGs at the point of care in areas such as; a) Diagnostics (Chronic Obstructive Disease), b) Urgent Referral (Lung Cancer), c) Genome testing and integration with CDS in screening (Lynch’s syndrome). In addition to medical care, the CDS solution can benefit organizational processes for collaborative care delivery by connecting patients, physicians and other associated members. This framework facilitates integration of different types of CDS ideal for the different healthcare processes, enabling sharable CDS capabilities within and across organizations

    Requirements of Modern Genome Browsers

    Get PDF
    Genome browsers are widely used tools for the visualization of a genome and related data. The demands placed on genome browsers due to the size, variety, and complexity of the data produced by modern biotechnology is increasing. These demands are poorly understood, and are not documented. Our study is establishing and documenting a clear set of requirements for genome browsers. Our study reviewed all widely used genome browsers, as well as notable research prototypes of genome browsers. This involved a review of the literature, executing typical uses of the genome browsers, program comprehension, reverse engineering, and code analysis. The key outcome of the study is a clear set of requirements in the form of a requirement document which conforms to the IEEE Std 830-1998 Standard of a Software Requirement Specification. This contains a domain model of concepts, the functional requirements as use cases, a definition of visualizations as metaphors, glyphs, or icons, formal specification of the system in Z notation and a specification of all widely used file formats. Genome browsers share a set of basic features like display, scroll, zoom, and search. However, they differ in their performance, maturity level and the implementation technologies. Our requirements also document the major non-functional requirements. The outcome of our study can be used in several ways: it can be used as a guide for future developers of Genome Browsers; it can form the basis of future enhancements of features in existing genome browsers; and it can motivate the invention of new algorithms, data structures, or file formats for implementations
    corecore