4 research outputs found
A Secure Data Enclave and Analytics Platform for Social Scientists
Data-driven research is increasingly ubiquitous and data itself is a defining asset for researchers, particularly in the computational social sciences and humanities. Entire careers and research communities are built around valuable, proprietary or sensitive datasets. However, many existing computation resources fail to support secure and cost-effective storage of data while also enabling secure and flexible analysis of the data. To address these needs we present CLOUD KOTTA, a cloud-based architecture for the secure management and analysis of social science data. CLOUD KOTTA leverages reliable, secure, and scalable cloud resources to deliver capabilities to users, and removes the need for users to manage complicated infrastructure.CLOUD KOTTA implements automated, cost-aware models for efficiently provisioning tiered storage and automatically scaled compute resources.CLOUD KOTTA has been used in production for several months and currently manages approximately 10TB of data and has been used to process more than 5TB of data with over 75,000 CPU hours. It has been used for a broad variety of text analysis workflows, matrix factorization, and various machine learning algorithms, and more broadly, it supports fast, secure and cost-effective research
Enabling Interactive Analytics of Secure Data using Cloud Kotta
Research, especially in the social sciences and humanities, is increasingly
reliant on the application of data science methods to analyze large amounts of
(often private) data. Secure data enclaves provide a solution for managing and
analyzing private data. However, such enclaves do not readily support discovery
science---a form of exploratory or interactive analysis by which researchers
execute a range of (sometimes large) analyses in an iterative and collaborative
manner. The batch computing model offered by many data enclaves is well suited
to executing large compute tasks; however it is far from ideal for day-to-day
discovery science. As researchers must submit jobs to queues and wait for
results, the high latencies inherent in queue-based, batch computing systems
hinder interactive analysis. In this paper we describe how we have augmented
the Cloud Kotta secure data enclave to support collaborative and interactive
analysis of sensitive data. Our model uses Jupyter notebooks as a flexible
analysis environment and Python language constructs to support the execution of
arbitrary functions on private data within this secure framework.Comment: To appear in Proceedings of Workshop on Scientific Cloud Computing,
Washington, DC USA, June 2017 (ScienceCloud 2017), 7 page
Computing environments for reproducibility: Capturing the 'Whole Tale'
The act of sharing scientific knowledge is rapidly evolving away from traditional articles and presentations to the delivery of executable objects that integrate the data and computational details (e.g., scripts and workflows) upon which the findings rely. This envisioned coupling of data and process is essential to advancing science but faces technical and institutional barriers. The Whole Tale project aims to address these barriers by connecting computational, data-intensive research efforts with the larger research process—transforming the knowledge discovery and dissemination process into one where data products are united with research articles to create “living publications” or tales. The Whole Tale focuses on the full spectrum of science, empowering users in the long tail of science, and power users with demands for access to big data and compute resources. We report here on the design, architecture, and implementation of the Whole Tale environment