2,009 research outputs found
Mark 3 interactive data analysis system
The interactive data analysis system, a major subset of the total Mark 3 very long baseline interferometry (VLBI) software system is described. The system consists of two major and a number of small programs. These programs provide for the scientific analysis of the observed values of delay and delay rate generated by the VLBI data reduction programs and product the geophysical and astrometric parameters which are among the ultimate products of VLBI. The two major programs are CALC and SOLVE. CALC generates the theoretical values of VLBI delay rate as well as partial derivatives based on apriori values of the geophysical and astronometric parameters. SOLVE is a least squares parameters estimation program which yields the geophysical and astrometric parameters using the observed values by the data processing system and theoretical values and partial derivatives provided by CALC. SOLVE is a highly interactive program in which the user selects the exact form of the recovered parameters and the data to be accepted into the solution
Preventing False Discovery in Interactive Data Analysis is Hard
We show that, under a standard hardness assumption, there is no
computationally efficient algorithm that given samples from an unknown
distribution can give valid answers to adaptively chosen
statistical queries. A statistical query asks for the expectation of a
predicate over the underlying distribution, and an answer to a statistical
query is valid if it is "close" to the correct expectation over the
distribution.
Our result stands in stark contrast to the well known fact that exponentially
many statistical queries can be answered validly and efficiently if the queries
are chosen non-adaptively (no query may depend on the answers to previous
queries). Moreover, a recent work by Dwork et al. shows how to accurately
answer exponentially many adaptively chosen statistical queries via a
computationally inefficient algorithm; and how to answer a quadratic number of
adaptive queries via a computationally efficient algorithm. The latter result
implies that our result is tight up to a linear factor in
Conceptually, our result demonstrates that achieving statistical validity
alone can be a source of computational intractability in adaptive settings. For
example, in the modern large collaborative research environment, data analysts
typically choose a particular approach based on previous findings. False
discovery occurs if a research finding is supported by the data but not by the
underlying distribution. While the study of preventing false discovery in
Statistics is decades old, to the best of our knowledge our result is the first
to demonstrate a computational barrier. In particular, our result suggests that
the perceived difficulty of preventing false discovery in today's collaborative
research environment may be inherent
Recommended from our members
Towards Natural Language Empowered Interactive Data Analysis
The recent advances in natural language based interaction methodologies offer promising avenues to enhance the interactive processes within the human-machine dialogue of visual analytics. We envisage \textit{Multimodal Data Analytics} as a novel approach for conducting data analysis that builds on the strengths of visual analytics and natural language as an expressive interaction channel. We investigate the potential enhancements from such a multimodal approach and discusses the preliminary outline for a structured methodology to study the role of natural language in data analytics. Our approach builds on a simple model of human machine dialogue for interactive data analysis which we then propose to instantiate as visual analytics workflows -- representations to study and operationalise interactive data analysis routines empowered by natural language interaction
Interactive Data Analysis of Multi-Run Performance Data
Multi-dimensional performance data analysis presents challenges for programmers, and users. Developers have to choose library and compiler options for each platform, analyze raw performance data, and keep up with new technologies. Users run codes on different platforms, validate results with collaborators, and analyze performance data as applications scale up. Site operators use multiple profiling tools to optimize performance, requiring the analysis of multiple sources and data types. There is currently no comprehensive tool to support the structured analysis of unstructured data, when holistic performance data analysis can offer actionable insights and improve performance. In this work, we present thicket, a tool designed based on the experiences and insights of programmers, and users to address these needs. Thicket is a Python-based data analysis toolkit that aims to make performance data exploration more accessible and user-friendly for application code developers, users, and site operators. It achieves this by providing a comprehensive interface that allows for the easy manipulation, modeling, and visualization of data collected from multiple tools and executions. The central element of Thicket is the ”thicket object,” which unifies data from multiple sources and allows for various data manipulation and modeling operations, includingfiltering, grouping, and querying, and statistical operations. Thicket also supports the useof external libraries such as scikit-learn and Extra-P for data modeling and visualization in an intuitive call tree context. Overall, Thicket aims to help users make better decisions about their application’s performance by providing actionable insights from complex and multi-dimensional performance data. Here, we present some capabilities extended by the components of thicket and important use cases that have implications beyond the data structure that provide these capabilities
Interactive Data Analysis with Next-step Natural Language Query Recommendation
Natural language interfaces (NLIs) provide users with a convenient way to
interactively analyze data through natural language queries. Nevertheless,
interactive data analysis is a demanding process, especially for novice data
analysts. When exploring large and complex SQL databases from different
domains, data analysts do not necessarily have sufficient knowledge about
different data tables and application domains. It makes them unable to
systematically elicit a series of topically-related and meaningful queries for
insight discovery in target domains. We develop a NLI with a step-wise query
recommendation module to assist users in choosing appropriate next-step
exploration actions. The system adopts a data-driven approach to suggest
semantically relevant and context-aware queries for application domains of
users' interest based on their query logs. Also, the system helps users
organize query histories and results into a dashboard to communicate the
discovered data insights. With a comparative user study, we show that our
system can facilitate a more effective and systematic data analysis process
than a baseline without the recommendation module.Comment: 14 pages, 6 figure
Rosetta: a container-centric science platform for resource-intensive, interactive data analysis
Rosetta is a science platform for resource-intensive, interactive data analysis which runs user tasks as software containers. It is built on top of a novel architecture based on framing user tasks as microservices - independent and self-contained units - which allows to fully support custom and user-defined software packages, libraries and environments. These include complete remote desktop and GUI applications, besides common analysis environments as the Jupyter Notebooks. Rosetta relies on Open Container Initiative containers, which allow for safe, effective and reproducible code execution; can use a number of container engines and runtimes; and seamlessly supports several workload management systems, thus enabling containerized workloads on a wide range of computing resources. Although developed in the astronomy and astrophysics space, Rosetta can virtually support any science and technology domain where resource-intensive, interactive data analysis is required
Geophysical and astronomical models applied in the analysis of very long baseline interferometry
Very long baseline interferometry presents an opportunity to measure at the centimeter level such geodetic parameters as baseline length and instantaneous pole position. In order to achieve such precision, the geophysical and astronomical models used in data analysis must be as accurate as possible. The Mark-3 interactive data analysis system includes a number of refinements beyond conventional practice in modeling precession, nutation, diurnal polar motion, UT1, solid Earth tides, relativistic light deflection, and reduction to solar system barycentric coordinates. The algorithms and their effects on the recovered geodetic, geophysical, and astrometric parameters are discussed
Visualization Techniques for Tongue Analysis in Traditional Chinese Medicine
Visual inspection of the tongue has been an important diagnostic method of Traditional Chinese Medicine (TCM). Clinic data have shown significant connections between various viscera cancers and abnormalities in the tongue and the tongue coating. Visual inspection of the tongue is simple and inexpensive, but the current practice in TCM is mainly experience-based and the quality of the visual inspection varies between individuals. The computerized inspection method provides quantitative models to evaluate color, texture and surface features on the tongue. In this paper, we investigate visualization techniques and processes to allow interactive data analysis with the aim to merge computerized measurements with human expert's diagnostic variables based on five-scale diagnostic conditions: Healthy (H), History Cancers (HC), History of Polyps (HP), Polyps (P) and Colon Cancer (C)
Downdating a time-varying square root information filter
A new method to efficiently downdate an estimate and covariance generated by a discrete time Square Root Information Filter (SRIF) is presented. The method combines the QR factor downdating algorithm of Gill and the decentralized SRIF algorithm of Bierman. Efficient removal of either measurements or a priori information is possible without loss of numerical integrity. Moreover, the method includes features for detecting potential numerical degradation. Performance on a 300 parameter system with 5800 data points shows that the method can be used in real time and hence is a promising tool for interactive data analysis. Additionally, updating a time-varying SRIF filter with either additional measurements or a priori information proceeds analogously
- …