33 research outputs found

    Big tranSMART for clinical decision making

    Get PDF
    Molecular profiling data based patient stratification plays a key role in clinical decision making, such as identification of disease subgroups and prediction of treatment responses of individual subjects. Many existing knowledge management systems like tranSMART enable scientists to do such analysis. But in the big data era, molecular profiling data size increases sharply due to new biological techniques, such as next generation sequencing. None of the existing storage systems work well while considering the three ”V” features of big data (Volume, Variety, and Velocity). New Key Value data stores like Apache HBase and Google Bigtable can provide high speed queries by the Key. These databases can be modeled as Distributed Ordered Table (DOT), which horizontally partitions a table into regions and distributes regions to region servers by the Key. However, none of existing data models work well for DOT. A Collaborative Genomic Data Model (CGDM) has been designed to solve all these is- sues. CGDM creates three Collaborative Global Clustering Index Tables to improve the data query velocity. Microarray implementation of CGDM on HBase performed up to 246, 7 and 20 times faster than the relational data model on HBase, MySQL Cluster and MongoDB. Single nucleotide polymorphism implementation of CGDM on HBase outperformed the relational model on HBase and MySQL Cluster by up to 351 and 9 times. Raw sequence implementation of CGDM on HBase gains up to 440-fold and 22-fold speedup, compared to the sequence alignment map format implemented in HBase and a binary alignment map server. The integration into tranSMART shows up to 7-fold speedup in the data export function. In addition, a popular hierarchical clustering algorithm in tranSMART has been used as an application to indicate how CGDM can influence the velocity of the algorithm. The optimized method using CGDM performs more than 7 times faster than the same method using the relational model implemented in MySQL Cluster.Open Acces

    From shallow sands to deep-sea trenches: Towards integrative systematics of Solenogastres (Aplacophora, Mollusca)

    Get PDF
    The marine realm encompasses a plethora of habitats: from light-flooded tropical coral reefs down to chemosynthetic vents and seeps to oceanic trenches several kilometers below the ocean’s surface. Habitat destruction, pollution, and effects of climate change accelerate rates of species extinction and pose a massive threat to marine ecosystems and biodiversity. Lack of baseline knowledge on species diversity is the key shortfall of current biodiversity research and especially prevalent among small-size invertebrates which constitute the larger part of global, metazoan biodiversity. Solenogastres (or Neomeniomorpha), an enigmatic class of molluscs are one of those understudied and neglected marine taxa. Instead of bearing a shell, these worm-shaped molluscs are densely covered in aragonitic spicules (the scleritome). They have been found from the tropics to the poles and occur from shallow waters down to the deep sea, with a peak in diversity along the continental shelfs. Despite their circumglobal occurrence, less than 300 species of Solenogastres have been described during the last 150 years since their first discovery. However, natural history collections alone have been estimated to contain at least ten-times more undescribed species than are currently known. Taxonomy of Solenogastres is bulky, requiring a mosaic of morphological and anatomical characters even for higher classification and is thus considered notoriously complex among zoologists. Novel approaches to characterize solenogaster diversity are urgently needed in order to catch up with discovery rates and modernize the taxonomic process. During my dissertation, I aimed to explore the diversity and evolution of Solenogastres in two understudied marine environments: the shallow-water interstitial habitat (i.e. the pore spaces between sand grains) and the deep oceans beyond the bathyal zone. For this purpose, I developed a novel integrative taxonomic workflow combining morphological characters of traditional taxonomy with DNA barcoding for molecular approaches to species delineation, supplemented with state-of-the-art anatomical 3D reconstructions of selected key lineages. My dissertation research is based on Solenogastres collected by colleagues and myself during sampling trips targeting marine interstitial malacofauna in Bermuda, Hawaii, Azores, Honshu and Okinawa (Japan). I joined two out of a series of four international deep-sea expeditions collecting benthic fauna in the Northwest Pacific, sampling across a depth range from 1,600 m down to almost 10,000 m in the Kuril-Kamchatka Trench. Overall, these expeditions covered different areas in the Northwest Pacific of varying geological age and stages of isolation. Additional material was made available through the natural history collection of the Section Mollusca, Bavarian State Collection of Zoology (SNSB-ZSM München), resulting in a total of 347 Solenogastres investigated during the course of my dissertation. Based on my work we are now able to identify main clades of meiofaunal Solenogastres, in a first step towards elucidating their global diversity of the clade in the interstitial habitat. The discovery of a putative widely distributed mesopsammic lineage of Dondersiidae (order Pholidoskepia) at sampling sites in the Atlantic and Pacific is challenged by the presence of co-occurring morphologically cryptic species revealed through anatomical 3D reconstructions. This highlights 1.) the risk of chimeric species descriptions if several individuals are used to extract all sets of taxonomically relevant characters and 2.) the importance of molecular data to reliably test hypothesis on conspecificity and distribution patterns in this taxonomically challenging group. Northwest Pacific Solenogastres were delineated based on unique morphological characters (i.e. scleritome data) and, if possible, cross-validated via molecular-based phylogenetic analyses. This integrative approach resulted in 60 candidate species across regions and depth zones in the Northwest Pacific (additional 13 candidate species lack molecular data), with the majority constituting species new to science. Their diversity covers all four orders, at least nine families, and 15 genera – therein presenting an immense boost in regional diversity. On a global scale, the number of abyssal Solenogastres has been more than doubled by these studies, and the animals collected from the bottom of the Kuril-Kamchatka Trench provide the first evidence of this molluscan class from the hadal zone and hold its depth record at almost 10,000 meters. The established baseline dataset of alpha-diversity from adjacent areas and depths zones enabled a first glimpse into distribution patterns. While there was overall little faunal overlap between the investigated regions and depths, several unique links were revealed: 1.) across depth by an eurybathic species occurring in the Kuril Basin (3,350 m) and at the bottom of the trench (9,580 m); 2) across the Kuril-Kamchatka Trench: Kruppomenia genslerae Ostermair, Brandt, Haszprunar, Jörger & Bergmeier, 2018 was found in the Sea of Okhotsk and on the open abyssal plain, thereby indicating that a hadal trench does not pose an insurmountable dispersal barrier for benthic invertebrates; and 3) potentially across oceans: anatomical investigations suggest that an abyssal species from the Atlantic is also present on the Northwest Pacific Plain, albeit molecular data from the putative Atlantic conspecifics to support pan-oceanic distribution is lacking. In order to gain insights into the feeding ecology of deep-sea Solenogastres, we sequenced their gut contents from genomic DNA extracts. This molecular-based approach showed that they are highly specialized micropredators with taxon-specific prey preferences. While anthozoan and hydrozoan cnidarians have been generally assumed as the main food source of Solenogastres, Siphonophora, Nemertea, Annelida and Bivalvia have now been added to their menu. The molecular phylogeny used as a backbone for our integrative approach to characterize their diversity has also several implications for solenogaster systematics. As two fast evolving mitochondrial markers were used in its analyses, without counterbalancing conservative markers the phylogeny cannot reliably resolve deep relationships within a group that has been hypothesized to date back to the early Paleozoic. Nevertheless, as our dataset contains multiple species and genera across several families, we were able to test the validity of existing taxonomic units: several classificatory entities (i.e. the largest order Cavibelonia, families Acanthomeniidae and Pruvotinidae) were retrieved as polyphyletic which will thus necessitate major systematic revisions in the future. The integrative approach developed during my dissertation allows for fast and efficient species delineation. Scleritome characters were chosen as the main morphological trait, as they are comparatively easy to access and provide the necessary link to the existing classificatory system to prevent a parallel system of DNA-based taxonomy. At the same time, reducing the amount of required characters presents an efficient solution when confronted with small-sized animals and high proportions of singletons that hamper the use of single individuals for multiple lines of investigation (e.g. morphology, anatomy, DNA). The set-up of our community-curated online database AplacBase currently serves as an openly accessible repository and initial identification tool, providing supporting information and guiding researchers through the essence of aplacophoran taxonomy. However, in order to overcome the taxonomic deficits prevalent in Solenogastres, novel approaches need to aim beyond the characterization of their diversity and consequently provide efficient solutions to the currently complicated process of species descriptions and diagnosis. Based on a backbone phylogeny stabilized by mitochondrial genomes, a streamlined approach combining “deep taxonomy” with rapid, DNA-based taxonomy is proposed to tackle the emerging wealth of novel Solenogastres species

    Fine-Grained Provenance And Applications To Data Analytics Computation

    Get PDF
    Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; prove-nance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks – for data types such as strings, images, etc.Additionally, we need a provenance archival layer to store and manage the tracked fine-grained prove-nance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads

    Modern Systems for Large-scale Genomics Data Analysis in the Cloud

    Get PDF
    Genomics researchers increasingly turn to cloud computing as a means of accomplishing large-scale analyses efficiently and cost-effectively. Successful operation in the cloud requires careful instrumentation and management to avoid common pitfalls, such as resource bottlenecks and low utilisation that can both drive up costs and extend the timeline of a scientific project. We developed the Butler framework for large-scale scientific workflow management in the cloud to meet these challenges. The cornerstones of Butler design are: ability to support multiple clouds, declarative infrastructure configuration management, scalable, fault-tolerant operation, comprehensive resource monitoring, and automated error detection and recovery. Butler relies on industry-strength open-source components in order to deliver a framework that is robust and scalable to thousands of compute cores and millions of workflow executions. Butler’s error detection and self-healing capabilities are unique among scientific workflow frameworks and ensure that analyses are carried out with minimal human intervention. Butler has been used to analyse over 725TB of DNA sequencing data on the cloud, using 1500 CPU cores, and 6TB of RAM, delivering results with 43\% increased efficiency compared to other tools. The flexible design of this framework allows easy adoption within other fields of Life Sciences and ensures that it will scale together with the demand for scientific analysis in the cloud for years to come. Because many bioinformatics tools have been developed in the context of small sample sizes they often struggle to keep up with the demands for large-scale data processing required for modern research and clinical sequencing projects due to the limitations in their design. The Rheos software system is designed specifically with these large data sets in mind. Utilising the elastic compute capacity of modern academic and commercial clouds, Rheos takes a service-oriented containerised approach to the implementation of modern bioinformatics algorithms, which allows the software to achieve the scalability and ease-of-use required to succeed under increased operational load of massive data sets generated by projects like International Cancer Genomics Consortium (ICGC) Argo and the All of Us initiative. Rheos algorithms are based on an innovative stream-based approach for processing genomic data, which enables Rheos to make faster decisions about the presence of genomic mutations that drive diseases such as cancer, thereby improving the tools' efficacy and relevance to clinical sequencing applications. Our testing of the novel germline Single Nucleotide Polymorphism (SNP) and deletion variant calling algorithms developed within Rheos indicates that Rheos achieves ~98\% accuracy in SNP calling and ~85\% accuracy in deletion calling, which is comparable with other leading tools such as the Genome Analysis Toolkit (GATK), freebayes, and Delly. The two frameworks that we developed provide important contributions to solve the ever-growing need for large scale genomic data analysis on the cloud, by enabling more effective use of existing tools, in the case of Butler, and providing a new, more dynamic and real-time approach to genomic analysis, in the case of Rheos

    Annual Report

    Get PDF
    corecore