2,009 research outputs found

    Mark 3 interactive data analysis system

    Get PDF
    The interactive data analysis system, a major subset of the total Mark 3 very long baseline interferometry (VLBI) software system is described. The system consists of two major and a number of small programs. These programs provide for the scientific analysis of the observed values of delay and delay rate generated by the VLBI data reduction programs and product the geophysical and astrometric parameters which are among the ultimate products of VLBI. The two major programs are CALC and SOLVE. CALC generates the theoretical values of VLBI delay rate as well as partial derivatives based on apriori values of the geophysical and astronometric parameters. SOLVE is a least squares parameters estimation program which yields the geophysical and astrometric parameters using the observed values by the data processing system and theoretical values and partial derivatives provided by CALC. SOLVE is a highly interactive program in which the user selects the exact form of the recovered parameters and the data to be accepted into the solution

    Preventing False Discovery in Interactive Data Analysis is Hard

    Full text link
    We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given nn samples from an unknown distribution can give valid answers to n3+o(1)n^{3+o(1)} adaptively chosen statistical queries. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is valid if it is "close" to the correct expectation over the distribution. Our result stands in stark contrast to the well known fact that exponentially many statistical queries can be answered validly and efficiently if the queries are chosen non-adaptively (no query may depend on the answers to previous queries). Moreover, a recent work by Dwork et al. shows how to accurately answer exponentially many adaptively chosen statistical queries via a computationally inefficient algorithm; and how to answer a quadratic number of adaptive queries via a computationally efficient algorithm. The latter result implies that our result is tight up to a linear factor in n.n. Conceptually, our result demonstrates that achieving statistical validity alone can be a source of computational intractability in adaptive settings. For example, in the modern large collaborative research environment, data analysts typically choose a particular approach based on previous findings. False discovery occurs if a research finding is supported by the data but not by the underlying distribution. While the study of preventing false discovery in Statistics is decades old, to the best of our knowledge our result is the first to demonstrate a computational barrier. In particular, our result suggests that the perceived difficulty of preventing false discovery in today's collaborative research environment may be inherent

    Interactive Data Analysis of Multi-Run Performance Data

    Get PDF
    Multi-dimensional performance data analysis presents challenges for programmers, and users. Developers have to choose library and compiler options for each platform, analyze raw performance data, and keep up with new technologies. Users run codes on different platforms, validate results with collaborators, and analyze performance data as applications scale up. Site operators use multiple profiling tools to optimize performance, requiring the analysis of multiple sources and data types. There is currently no comprehensive tool to support the structured analysis of unstructured data, when holistic performance data analysis can offer actionable insights and improve performance. In this work, we present thicket, a tool designed based on the experiences and insights of programmers, and users to address these needs. Thicket is a Python-based data analysis toolkit that aims to make performance data exploration more accessible and user-friendly for application code developers, users, and site operators. It achieves this by providing a comprehensive interface that allows for the easy manipulation, modeling, and visualization of data collected from multiple tools and executions. The central element of Thicket is the ”thicket object,” which unifies data from multiple sources and allows for various data manipulation and modeling operations, includingfiltering, grouping, and querying, and statistical operations. Thicket also supports the useof external libraries such as scikit-learn and Extra-P for data modeling and visualization in an intuitive call tree context. Overall, Thicket aims to help users make better decisions about their application’s performance by providing actionable insights from complex and multi-dimensional performance data. Here, we present some capabilities extended by the components of thicket and important use cases that have implications beyond the data structure that provide these capabilities

    Interactive Data Analysis with Next-step Natural Language Query Recommendation

    Full text link
    Natural language interfaces (NLIs) provide users with a convenient way to interactively analyze data through natural language queries. Nevertheless, interactive data analysis is a demanding process, especially for novice data analysts. When exploring large and complex SQL databases from different domains, data analysts do not necessarily have sufficient knowledge about different data tables and application domains. It makes them unable to systematically elicit a series of topically-related and meaningful queries for insight discovery in target domains. We develop a NLI with a step-wise query recommendation module to assist users in choosing appropriate next-step exploration actions. The system adopts a data-driven approach to suggest semantically relevant and context-aware queries for application domains of users' interest based on their query logs. Also, the system helps users organize query histories and results into a dashboard to communicate the discovered data insights. With a comparative user study, we show that our system can facilitate a more effective and systematic data analysis process than a baseline without the recommendation module.Comment: 14 pages, 6 figure

    Rosetta: a container-centric science platform for resource-intensive, interactive data analysis

    Get PDF
    Rosetta is a science platform for resource-intensive, interactive data analysis which runs user tasks as software containers. It is built on top of a novel architecture based on framing user tasks as microservices - independent and self-contained units - which allows to fully support custom and user-defined software packages, libraries and environments. These include complete remote desktop and GUI applications, besides common analysis environments as the Jupyter Notebooks. Rosetta relies on Open Container Initiative containers, which allow for safe, effective and reproducible code execution; can use a number of container engines and runtimes; and seamlessly supports several workload management systems, thus enabling containerized workloads on a wide range of computing resources. Although developed in the astronomy and astrophysics space, Rosetta can virtually support any science and technology domain where resource-intensive, interactive data analysis is required

    Geophysical and astronomical models applied in the analysis of very long baseline interferometry

    Get PDF
    Very long baseline interferometry presents an opportunity to measure at the centimeter level such geodetic parameters as baseline length and instantaneous pole position. In order to achieve such precision, the geophysical and astronomical models used in data analysis must be as accurate as possible. The Mark-3 interactive data analysis system includes a number of refinements beyond conventional practice in modeling precession, nutation, diurnal polar motion, UT1, solid Earth tides, relativistic light deflection, and reduction to solar system barycentric coordinates. The algorithms and their effects on the recovered geodetic, geophysical, and astrometric parameters are discussed

    Visualization Techniques for Tongue Analysis in Traditional Chinese Medicine

    Get PDF
    Visual inspection of the tongue has been an important diagnostic method of Traditional Chinese Medicine (TCM). Clinic data have shown significant connections between various viscera cancers and abnormalities in the tongue and the tongue coating. Visual inspection of the tongue is simple and inexpensive, but the current practice in TCM is mainly experience-based and the quality of the visual inspection varies between individuals. The computerized inspection method provides quantitative models to evaluate color, texture and surface features on the tongue. In this paper, we investigate visualization techniques and processes to allow interactive data analysis with the aim to merge computerized measurements with human expert's diagnostic variables based on five-scale diagnostic conditions: Healthy (H), History Cancers (HC), History of Polyps (HP), Polyps (P) and Colon Cancer (C)

    Downdating a time-varying square root information filter

    Get PDF
    A new method to efficiently downdate an estimate and covariance generated by a discrete time Square Root Information Filter (SRIF) is presented. The method combines the QR factor downdating algorithm of Gill and the decentralized SRIF algorithm of Bierman. Efficient removal of either measurements or a priori information is possible without loss of numerical integrity. Moreover, the method includes features for detecting potential numerical degradation. Performance on a 300 parameter system with 5800 data points shows that the method can be used in real time and hence is a promising tool for interactive data analysis. Additionally, updating a time-varying SRIF filter with either additional measurements or a priori information proceeds analogously
    corecore