Search CORE

9,656 research outputs found

Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management

Author: Bendre Mangesh
Chang Kevin
Parameswaran Aditya
Venkataraman Vipul
Zhou Xinyan
Publication venue
Publication date: 05/10/2017
Field of study

Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make a first step towards this vision: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-Hard; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 20% reduction in storage, and up to 50% reduction in formula evaluation time

arXiv.org e-Print Archive

Crossref

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Author: Begoli Edmon
Hyde Julian
Lemire Daniel
Mior Michael J.
Rodríguez Jesús Camacho
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2018
Field of study

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

arXiv.org e-Print Archive

R-libre

EcoGIS – GIS tools for ecosystem approaches to fisheries management

Author: Finnen Eric
Haverland Tim
Nelson David Moe
Publication venue: NOAA/National Ocean Service/Center for Coastal Monitoring and Assessment
Publication date: 01/01/2009
Field of study

Executive Summary: The EcoGIS project was launched in September 2004 to investigate how Geographic Information Systems (GIS), marine data, and custom analysis tools can better enable fisheries scientists and managers to adopt Ecosystem Approaches to Fisheries Management (EAFM). EcoGIS is a collaborative effort between NOAA’s National Ocean Service (NOS) and National Marine Fisheries Service (NMFS), and four regional Fishery Management Councils. The project has focused on four priority areas: Fishing Catch and Effort Analysis, Area Characterization, Bycatch Analysis, and Habitat Interactions. Of these four functional areas, the project team first focused on developing a working prototype for catch and effort analysis: the Fishery Mapper Tool. This ArcGIS extension creates time-and-area summarized maps of fishing catch and effort from logbook, observer, or fishery-independent survey data sets. Source data may come from Oracle, Microsoft Access, or other file formats. Feedback from beta-testers of the Fishery Mapper was used to debug the prototype, enhance performance, and add features. This report describes the four priority functional areas, the development of the Fishery Mapper tool, and several themes that emerged through the parallel evolution of the EcoGIS project, the concept and implementation of the broader field of Ecosystem Approaches to Management (EAM), data management practices, and other EAM toolsets. In addition, a set of six succinct recommendations are proposed on page 29. One major conclusion from this work is that there is no single “super-tool” to enable Ecosystem Approaches to Management; as such, tools should be developed for specific purposes with attention given to interoperability and automation. Future work should be coordinated with other GIS development projects in order to provide “value added” and minimize duplication of efforts. In addition to custom tools, the development of cross-cutting Regional Ecosystem Spatial Databases will enable access to quality data to support the analyses required by EAM. GIS tools will be useful in developing Integrated Ecosystem Assessments (IEAs) and providing pre- and post-processing capabilities for spatially-explicit ecosystem models. Continued funding will enable the EcoGIS project to develop GIS tools that are immediately applicable to today’s needs. These tools will enable simplified and efficient data query, the ability to visualize data over time, and ways to synthesize multidimensional data from diverse sources. These capabilities will provide new information for analyzing issues from an ecosystem perspective, which will ultimately result in better understanding of fisheries and better support for decision-making. (PDF file contains 45 pages.

Aquatic Commons

Sets and indices in linear programming modelling and their integration with relational data models

Author: Kristjansson B
Lucas CA
Mitra G
Moody S
Publication venue: Brunel University
Publication date: 01/01/1993
Field of study

LP models are usually constructed using index sets and data tables which are closely related to the attributes and relations of relational database (RDB) systems. We extend the syntax of MPL, an existing LP modelling language, in order to connect it to a given RDB system. This approach reuses existing modelling and database software, provides a rich modelling environment and achieves model and data independence. This integrated software enables Mathematical Programming to be widely used as a decision support tool by unlocking the data residing in corporate databases

CiteSeerX

Brunel University Research Archive

DualTable: A Hybrid Storage Model for Update Optimization in Hive

Author: Hu Songlin
Huang Shuo
Jacobsen Hans-Arno
Liang Ying
Liu Wantao
Pei Xubin
Rabl Tilmann
Wang Jiye
Xiao Zheng
Publication venue
Publication date: 01/12/2014
Field of study

Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201

arXiv.org e-Print Archive

Crossref

vSPARQL: A View Definition Language for the Semantic Web

Author: Brinkley James F
Detwiler Landon T
Noy N. F.
Shaw Marianne
Suciu Dan
Publication venue
Publication date: 01/02/2011
Field of study

Translational medicine applications would like to leverage the biological and biomedical ontologies, vocabularies, and data sets available on the semantic web. We present a general solution for RDF information set reuse inspired by database views. Our view definition language, vSPARQL, allows applications to specify the exact content that they are interested in and how that content should be restructured or modified. Applications can access relevant content by querying against these view definitions. We evaluate the expressivity of our approach by defining views for practical use cases and comparing our view definition language to existing query languages

Elsevier - Publisher Connector

University of Washington Structural Informatics Group Publications