14,396 research outputs found
Pruning and Nonparametric Multiple Change Point Detection
Change point analysis is a statistical tool to identify homogeneity within
time series data. We propose a pruning approach for approximate nonparametric
estimation of multiple change points. This general purpose change point
detection procedure `cp3o' applies a pruning routine within a dynamic program
to greatly reduce the search space and computational costs. Existing
goodness-of-fit change point objectives can immediately be utilized within the
framework. We further propose novel change point algorithms by applying cp3o to
two popular nonparametric goodness of fit measures: `e-cp3o' uses E-statistics,
and `ks-cp3o' uses Kolmogorov-Smirnov statistics. Simulation studies highlight
the performance of these algorithms in comparison with parametric and other
nonparametric change point methods. Finally, we illustrate these approaches
with climatological and financial applications.Comment: 9 pages. arXiv admin note: text overlap with arXiv:1505.0430
A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data
Many interesting data sets available on the Internet are of a medium
size---too big to fit into a personal computer's memory, but not so large that
they won't fit comfortably on its hard disk. In the coming years, data sets of
this magnitude will inform vital research in a wide array of application
domains. However, due to a variety of constraints they are cumbersome to
ingest, wrangle, analyze, and share in a reproducible fashion. These
obstructions hamper thorough peer-review and thus disrupt the forward progress
of science. We propose a predictable and pipeable framework for R (the
state-of-the-art statistical computing environment) that leverages SQL (the
venerable database architecture and query language) to make reproducible
research on medium data a painless reality.Comment: 30 pages, plus supplementary material
Deceit: A flexible distributed file system
Deceit, a distributed file system (DFS) being developed at Cornell, focuses on flexible file semantics in relation to efficiency, scalability, and reliability. Deceit servers are interchangeable and collectively provide the illusion of a single, large server machine to any clients of the Deceit service. Non-volatile replicas of each file are stored on a subset of the file servers. The user is able to set parameters on a file to achieve different levels of availability, performance, and one-copy serializability. Deceit also supports a file version control mechanism. In contrast with many recent DFS efforts, Deceit can behave like a plain Sun Network File System (NFS) server and can be used by any NFS client without modifying any client software. The current Deceit prototype uses the ISIS Distributed Programming Environment for all communication and process group management, an approach that reduces system complexity and increases system robustness
Automated data processing architecture for the Gemini Planet Imager Exoplanet Survey
The Gemini Planet Imager Exoplanet Survey (GPIES) is a multi-year direct
imaging survey of 600 stars to discover and characterize young Jovian
exoplanets and their environments. We have developed an automated data
architecture to process and index all data related to the survey uniformly. An
automated and flexible data processing framework, which we term the Data
Cruncher, combines multiple data reduction pipelines together to process all
spectroscopic, polarimetric, and calibration data taken with GPIES. With no
human intervention, fully reduced and calibrated data products are available
less than an hour after the data are taken to expedite follow-up on potential
objects of interest. The Data Cruncher can run on a supercomputer to reprocess
all GPIES data in a single day as improvements are made to our data reduction
pipelines. A backend MySQL database indexes all files, which are synced to the
cloud, and a front-end web server allows for easy browsing of all files
associated with GPIES. To help observers, quicklook displays show reduced data
as they are processed in real-time, and chatbots on Slack post observing
information as well as reduced data products. Together, the GPIES automated
data processing architecture reduces our workload, provides real-time data
reduction, optimizes our observing strategy, and maintains a homogeneously
reduced dataset to study planet occurrence and instrument performance.Comment: 21 pages, 3 figures, accepted in JATI
Transportable Applications Environment (TAE) Plus: A NASA user interface development and management system
The transportable Applications Environment Plus (TAE Plus), developed at the NASA Goddard Space FLight Center, is a portable, What you see is what you get (WYSIWYG) user interface development and management system. Its primary objective is to provide an integrated software environment that allows interactive prototyping and development of graphical user interfaces, as well as management of the user interface within the operational domain. TAE Plus is being applied to many types of applications, and what TAE Plus provides, how the implementation has utilizes state-of-the-art technologies within graphic workstations, and how it has been used both within and without NASA are discussed
- …