1,267 research outputs found
Leveraging OpenStack and Ceph for a Controlled-Access Data Cloud
While traditional HPC has and continues to satisfy most workflows, a new
generation of researchers has emerged looking for sophisticated, scalable,
on-demand, and self-service control of compute infrastructure in a cloud-like
environment. Many also seek safe harbors to operate on or store sensitive
and/or controlled-access data in a high capacity environment.
To cater to these modern users, the Minnesota Supercomputing Institute
designed and deployed Stratus, a locally-hosted cloud environment powered by
the OpenStack platform, and backed by Ceph storage. The subscription-based
service complements existing HPC systems by satisfying the following unmet
needs of our users: a) on-demand availability of compute resources, b)
long-running jobs (i.e., days), c) container-based computing with
Docker, and d) adequate security controls to comply with controlled-access data
requirements.
This document provides an in-depth look at the design of Stratus with respect
to security and compliance with the NIH's controlled-access data policy.
Emphasis is placed on lessons learned while integrating OpenStack and Ceph
features into a so-called "walled garden", and how those technologies
influenced the security design. Many features of Stratus, including tiered
secure storage with the introduction of a controlled-access data "cache",
fault-tolerant live-migrations, and fully integrated two-factor authentication,
depend on recent OpenStack and Ceph features.Comment: 7 pages, 5 figures, PEARC '18: Practice and Experience in Advanced
Research Computing, July 22--26, 2018, Pittsburgh, PA, US
Recommended from our members
Kronos: a workflow assembler for genome analytics and informatics.
BackgroundThe field of next-generation sequencing informatics has matured to a point where algorithmic advances in sequence alignment and individual feature detection methods have stabilized. Practical and robust implementation of complex analytical workflows (where such tools are structured into "best practices" for automated analysis of next-generation sequencing datasets) still requires significant programming investment and expertise.ResultsWe present Kronos, a software platform for facilitating the development and execution of modular, auditable, and distributable bioinformatics workflows. Kronos obviates the need for explicit coding of workflows by compiling a text configuration file into executable Python applications. Making analysis modules would still require programming. The framework of each workflow includes a run manager to execute the encoded workflows locally (or on a cluster or cloud), parallelize tasks, and log all runtime events. The resulting workflows are highly modular and configurable by construction, facilitating flexible and extensible meta-applications that can be modified easily through configuration file editing. The workflows are fully encoded for ease of distribution and can be instantiated on external systems, a step toward reproducible research and comparative analyses. We introduce a framework for building Kronos components that function as shareable, modular nodes in Kronos workflows.ConclusionsThe Kronos platform provides a standard framework for developers to implement custom tools, reuse existing tools, and contribute to the community at large. Kronos is shipped with both Docker and Amazon Web Services Machine Images. It is free, open source, and available through the Python Package Index and at https://github.com/jtaghiyar/kronos
VM-MAD: a cloud/cluster software for service-oriented academic environments
The availability of powerful computing hardware in IaaS clouds makes cloud
computing attractive also for computational workloads that were up to now
almost exclusively run on HPC clusters.
In this paper we present the VM-MAD Orchestrator software: an open source
framework for cloudbursting Linux-based HPC clusters into IaaS clouds but also
computational grids. The Orchestrator is completely modular, allowing flexible
configurations of cloudbursting policies. It can be used with any batch system
or cloud infrastructure, dynamically extending the cluster when needed. A
distinctive feature of our framework is that the policies can be tested and
tuned in a simulation mode based on historical or synthetic cluster accounting
data.
In the paper we also describe how the VM-MAD Orchestrator was used in a
production environment at the FGCZ to speed up the analysis of mass
spectrometry-based protein data by cloudbursting to the Amazon EC2. The
advantages of this hybrid system are shown with a large evaluation run using
about hundred large EC2 nodes.Comment: 16 pages, 5 figures. Accepted at the International Supercomputing
Conference ISC13, June 17--20 Leipzig, German
Rice Galaxy: An open resource for plant science
Background: Rice molecular genetics, breeding, genetic diversity, and allied research (such as rice-pathogen interaction) have adopted sequencing technologies and high-density genotyping platforms for genome variation analysis and gene discovery. Germplasm collections representing rice diversity, improved varieties, and elite breeding materials are accessible through rice gene banks for use in research and breeding, with many having genome sequences and high-density genotype data available. Combining phenotypic and genotypic information on these accessions enables genome-wide association analysis, which is driving quantitative trait loci discovery and molecular marker development. Comparative sequence analyses across quantitative trait loci regions facilitate the discovery of novel alleles. Analyses involving DNA sequences and large genotyping matrices for thousands of samples, however, pose a challenge to non−computer savvy rice researchers. Findings: The Rice Galaxy resource has shared datasets that include high-density genotypes from the 3,000 Rice Genomes project and sequences with corresponding annotations from 9 published rice genomes. The Rice Galaxy web server and deployment installer includes tools for designing single-nucleotide polymorphism assays, analyzing genome-wide association studies, population diversity, rice−bacterial pathogen diagnostics, and a suite of published genomic prediction methods. A prototype Rice Galaxy compliant to Open Access, Open Data, and Findable, Accessible, Interoperable, and Reproducible principles is also presented. Conclusions: Rice Galaxy is a freely available resource that empowers the plant research community to perform state-of-the-art analyses and utilize publicly available big datasets for both fundamental and applied science
Genomics Virtual Laboratory: a practical bioinformatics workbench for the cloud
Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets ; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces ; highly available, scalable computational resources ; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise
From Bare Metal to Virtual: Lessons Learned when a Supercomputing Institute Deploys its First Cloud
As primary provider for research computing services at the University of
Minnesota, the Minnesota Supercomputing Institute (MSI) has long been
responsible for serving the needs of a user-base numbering in the thousands.
In recent years, MSI---like many other HPC centers---has observed a growing
need for self-service, on-demand, data-intensive research, as well as the
emergence of many new controlled-access datasets for research purposes. In
light of this, MSI constructed a new on-premise cloud service, named Stratus,
which is architected from the ground up to easily satisfy data-use agreements
and fill four gaps left by traditional HPC. The resulting OpenStack cloud,
constructed from HPC-specific compute nodes and backed by Ceph storage, is
designed to fully comply with controls set forth by the NIH Genomic Data
Sharing Policy.
Herein, we present twelve lessons learned during the ambitious sprint to take
Stratus from inception and into production in less than 18 months. Important,
and often overlooked, components of this timeline included the development of
new leadership roles, staff and user training, and user support documentation.
Along the way, the lessons learned extended well beyond the technical
challenges often associated with acquiring, configuring, and maintaining
large-scale systems.Comment: 8 pages, 5 figures, PEARC '18: Practice and Experience in Advanced
Research Computing, July 22--26, 2018, Pittsburgh, PA, US
- …