270 research outputs found

    Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management

    Full text link
    Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make a first step towards this vision: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-Hard; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 20% reduction in storage, and up to 50% reduction in formula evaluation time

    Schema Independent Relational Learning

    Full text link
    Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of current relational learning algorithms tends to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency. This variation complicates their off-the-shelf application. In this paper, we introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to an oracle. We prove that current relational learning algorithms are generally not schema independent. For query-based learning algorithms we show that the (de) composition transformations influence their query complexity. We propose Castor, a sample-based relational learning algorithm that achieves schema independence by leveraging data dependencies. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark and real-world datasets under (de) compositions

    On the Practical use of Variable Elimination in Constraint Optimization Problems: 'Still-life' as a Case Study

    Full text link
    Variable elimination is a general technique for constraint processing. It is often discarded because of its high space complexity. However, it can be extremely useful when combined with other techniques. In this paper we study the applicability of variable elimination to the challenging problem of finding still-lifes. We illustrate several alternatives: variable elimination as a stand-alone algorithm, interleaved with search, and as a source of good quality lower bounds. We show that these techniques are the best known option both theoretically and empirically. In our experiments we have been able to solve the n=20 instance, which is far beyond reach with alternative approaches

    Parallel Natural Language Parsing: From Analysis to Speedup

    Get PDF
    Electrical Engineering, Mathematics and Computer Scienc

    On a Vehicle Routing Problem with Customer Costs and Multi Depots

    Get PDF
    The Vehicle Routing Problem with Customer Costs (short VRPCC) was developed for railway maintenance scheduling. In detail, corrective maintenance jobs for unexpected occurring failures are planned to a short time horizon. These jobs are geographically distributed in the railway net. Furthermore, dependent on the severity of the failure, it can be necessary to reduce the top speed on the track section in order to avoid safety risks or a too fast deterioration. For fatal failures, it can even be necessary to close the track section. The resulting limitations on railway service lead to penalty costs for the maintenance operator. These must be paid until the track is repaired and the restrictions are removed. By scheduling the maintenance tasks, these penalty costs can be reduced by proceeding corresponding maintenance tasks earlier. However, this may in return lead to increased costs for moving the maintenance machines and crews. For this scheduling problem, the VRPCC was developed. With it, for each maintenance vehicle and crew, a route is defined that describes the order to proceed maintenance tasks. Two kinds of costs are considered: Firstly, travel costs for machinery and crew; and secondly, penalty costs for an unsafe track condition that have to be paid for each day from failure detection to maintenance completion. To model the penalties, the novel customer costs are defined. In detail, for each maintenance activity a customer cost coefficient is given which incur for each day between failure detection and failure repair. The objective function of this problem is defined by the sum of travel costs and time-dependent customer costs. With it, the priority of customers can be taken into account without losing the sight on travel costs. This new vehicle routing problem was introduced in this thesis by a non-linear partition and permutation model. In this model, a feasible solution is defined by a partition of the job set into subsets that represent the allocation of jobs to vehicles and a permutation for each subset that represent the order of processing the jobs. Then, the start times of the jobs were calculated based on the order given by the permutations. It was taken into account that work can only be done in eight hour shifts during the night. Based on the start times, the customer cost value of each job is computed which equals to the paid penalty costs. Then, the costs of a schedule are calculated via the sum of travel costs and customer costs. To solve the VRPCC by a commercial linear programming solver, different formulations of the VRPCC as mixed-integer linear program were developed. In doing so, the start times became decision variables. It turned out that including customer costs led to problems harder to solve than vehicle routing problems where only travel costs are minimized. Further, in the thesis several construction heuristics for the VRPCC were designed and investigated. Also two local search algorithms, first and best improvement, were applied. The computational experiments showed that the solutions generated by the local search algorithm were much better than the solutions of the construction heuristics. The main part of this thesis was to design a Branch-and-Bound algorithm for the VRPCC. For this purpose, new lower bounds for the customer cost part of the objective function were formulated. The computational experiments showed that a lower bound computed from the LP relaxation of a specific bin packing problem had the best trade-off between computational effort and bound quality. For the travel cost part of the objective function, several known lower bounds from the TSP were compared. To design a Branch-and-Bound algorithm, beside efficient lower bound, also suitable branching strategies are necessary to split the problem space into smaller subspaces. In this thesis two branching strategies were developed which are based on the non-linear partition and permutation model to take advantage from the problem structure. To be more precise, new branches are generated by appending or including a job to an uncompleted schedule. Consequently, the start times can be computed directly from the so far planned jobs and more tight lower bounds can be computed for the so far unplanned jobs. By means of computational experiments, the developed Branch-and-Bound algorithms were compared with the classical approach, which means solving a mixed-integer linear program of the VRPCC by a commercial solver. The results showed that both Branch-and-Bound algorithms solved the small instances faster than the classical approach

    Stream Processing using Grammars and Regular Expressions

    Full text link
    In this dissertation we study regular expression based parsing and the use of grammatical specifications for the synthesis of fast, streaming string-processing programs. In the first part we develop two linear-time algorithms for regular expression based parsing with Perl-style greedy disambiguation. The first algorithm operates in two passes in a semi-streaming fashion, using a constant amount of working memory and an auxiliary tape storage which is written in the first pass and consumed by the second. The second algorithm is a single-pass and optimally streaming algorithm which outputs as much of the parse tree as is semantically possible based on the input prefix read so far, and resorts to buffering as many symbols as is required to resolve the next choice. Optimality is obtained by performing a PSPACE-complete pre-analysis on the regular expression. In the second part we present Kleenex, a language for expressing high-performance streaming string processing programs as regular grammars with embedded semantic actions, and its compilation to streaming string transducers with worst-case linear-time performance. Its underlying theory is based on transducer decomposition into oracle and action machines, and a finite-state specialization of the streaming parsing algorithm presented in the first part. In the second part we also develop a new linear-time streaming parsing algorithm for parsing expression grammars (PEG) which generalizes the regular grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm reformulated using least fixed points and evaluated using an instance of the chaotic iteration scheme by Cousot and Cousot
    • ā€¦
    corecore