270 research outputs found
Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management
Spreadsheet software is the tool of choice for interactive ad-hoc data
management, with adoption by billions of users. However, spreadsheets are not
scalable, unlike database systems. On the other hand, database systems, while
highly scalable, do not support interactivity as a first-class primitive. We
are developing DataSpread, to holistically integrate spreadsheets as a
front-end interface with databases as a back-end datastore, providing
scalability to spreadsheets, and interactivity to databases, an integration we
term presentational data management (PDM). In this paper, we make a first step
towards this vision: developing a storage engine for PDM, studying how to
flexibly represent spreadsheet data within a database and how to support and
maintain access by position. We first conduct an extensive survey of
spreadsheet use to motivate our functional requirements for a storage engine
for PDM. We develop a natural set of mechanisms for flexibly representing
spreadsheet data and demonstrate that identifying the optimal representation is
NP-Hard; however, we develop an efficient approach to identify the optimal
representation from an important and intuitive subclass of representations. We
extend our mechanisms with positional access mechanisms that don't suffer from
cascading update issues, leading to constant time access and modification
performance. We evaluate these representations on a workload of typical
spreadsheets and spreadsheet operations, providing up to 20% reduction in
storage, and up to 50% reduction in formula evaluation time
Schema Independent Relational Learning
Learning novel concepts and relations from relational databases is an
important problem with many applications in database systems and machine
learning. Relational learning algorithms learn the definition of a new relation
in terms of existing relations in the database. Nevertheless, the same data set
may be represented under different schemas for various reasons, such as
efficiency, data quality, and usability. Unfortunately, the output of current
relational learning algorithms tends to vary quite substantially over the
choice of schema, both in terms of learning accuracy and efficiency. This
variation complicates their off-the-shelf application. In this paper, we
introduce and formalize the property of schema independence of relational
learning algorithms, and study both the theoretical and empirical dependence of
existing algorithms on the common class of (de) composition schema
transformations. We study both sample-based learning algorithms, which learn
from sets of labeled examples, and query-based algorithms, which learn by
asking queries to an oracle. We prove that current relational learning
algorithms are generally not schema independent. For query-based learning
algorithms we show that the (de) composition transformations influence their
query complexity. We propose Castor, a sample-based relational learning
algorithm that achieves schema independence by leveraging data dependencies. We
support the theoretical results with an empirical study that demonstrates the
schema dependence/independence of several algorithms on existing benchmark and
real-world datasets under (de) compositions
On the Practical use of Variable Elimination in Constraint Optimization Problems: 'Still-life' as a Case Study
Variable elimination is a general technique for constraint processing. It is
often discarded because of its high space complexity. However, it can be
extremely useful when combined with other techniques. In this paper we study
the applicability of variable elimination to the challenging problem of finding
still-lifes. We illustrate several alternatives: variable elimination as a
stand-alone algorithm, interleaved with search, and as a source of good quality
lower bounds. We show that these techniques are the best known option both
theoretically and empirically. In our experiments we have been able to solve
the n=20 instance, which is far beyond reach with alternative approaches
Parallel Natural Language Parsing: From Analysis to Speedup
Electrical Engineering, Mathematics and Computer Scienc
On a Vehicle Routing Problem with Customer Costs and Multi Depots
The Vehicle Routing Problem with Customer Costs (short VRPCC) was developed for railway maintenance scheduling. In detail, corrective maintenance jobs for unexpected occurring failures are planned to a short time horizon. These jobs are geographically distributed in the railway net. Furthermore, dependent on the severity of the failure, it can be necessary to reduce the top speed on the track section in order to avoid safety risks or a too fast deterioration. For fatal failures, it can even be necessary to close the track section. The resulting limitations on railway service lead to penalty costs for the maintenance operator. These must be paid until the track is repaired and the restrictions are removed. By scheduling the maintenance tasks, these penalty costs can be reduced by proceeding corresponding maintenance tasks earlier. However, this may in return lead to increased costs for moving the maintenance machines and crews.
For this scheduling problem, the VRPCC was developed. With it, for each maintenance vehicle and crew, a route is defined that describes the order to proceed maintenance tasks. Two kinds of costs are considered: Firstly, travel costs for machinery and crew; and secondly, penalty costs for an unsafe track condition that have to be paid for each day from failure detection to maintenance completion. To model the penalties, the novel customer costs are defined. In detail, for each maintenance activity a customer cost coefficient is given which incur for each day between failure detection and failure repair. The objective function of this problem is defined by the sum of travel costs and time-dependent customer costs. With it, the priority of customers can be taken into account without losing the sight on travel costs.
This new vehicle routing problem was introduced in this thesis by a non-linear partition and permutation model. In this model, a feasible solution is defined by a partition of the job set into subsets that represent the allocation of jobs to vehicles and a permutation for each subset that represent the order of processing the jobs. Then, the start times of the jobs were calculated based on the order given by the permutations. It was taken into account that work can only be done in eight hour shifts during the night. Based on the start times, the customer cost value of each job is computed which equals to the paid penalty costs. Then, the costs of a schedule are calculated via the sum of travel costs and customer costs.
To solve the VRPCC by a commercial linear programming solver, different formulations of the VRPCC as mixed-integer linear program were developed. In doing so, the start times became decision variables. It turned out that including customer costs led to problems harder to solve than vehicle routing problems where only travel costs are minimized.
Further, in the thesis several construction heuristics for the VRPCC were designed and investigated. Also two local search algorithms, first and best improvement, were applied. The computational experiments showed that the solutions generated by the local search algorithm were much better than the solutions of the construction heuristics.
The main part of this thesis was to design a Branch-and-Bound algorithm for the VRPCC. For this purpose, new lower bounds for the customer cost part of the objective function were formulated. The computational experiments showed that a lower bound computed from the LP relaxation of a specific bin packing problem had the best trade-off between computational effort and bound quality. For the travel cost part of the objective function, several known lower bounds from the TSP were compared.
To design a Branch-and-Bound algorithm, beside efficient lower bound, also suitable branching strategies are necessary to split the problem space into smaller subspaces. In this thesis two branching strategies were developed which are based on the non-linear partition and permutation model to take advantage from the problem structure. To be more precise, new branches are generated by appending or including a job to an uncompleted schedule. Consequently, the start times can be computed directly from the so far planned jobs and more tight lower bounds can be computed for the so far unplanned jobs.
By means of computational experiments, the developed Branch-and-Bound algorithms were compared with the classical approach, which means solving a mixed-integer linear program of the VRPCC by a commercial solver. The results showed that both Branch-and-Bound algorithms solved the small instances faster than the classical approach
Stream Processing using Grammars and Regular Expressions
In this dissertation we study regular expression based parsing and the use of
grammatical specifications for the synthesis of fast, streaming
string-processing programs.
In the first part we develop two linear-time algorithms for regular
expression based parsing with Perl-style greedy disambiguation. The first
algorithm operates in two passes in a semi-streaming fashion, using a constant
amount of working memory and an auxiliary tape storage which is written in the
first pass and consumed by the second. The second algorithm is a single-pass
and optimally streaming algorithm which outputs as much of the parse tree as is
semantically possible based on the input prefix read so far, and resorts to
buffering as many symbols as is required to resolve the next choice. Optimality
is obtained by performing a PSPACE-complete pre-analysis on the regular
expression.
In the second part we present Kleenex, a language for expressing
high-performance streaming string processing programs as regular grammars with
embedded semantic actions, and its compilation to streaming string transducers
with worst-case linear-time performance. Its underlying theory is based on
transducer decomposition into oracle and action machines, and a finite-state
specialization of the streaming parsing algorithm presented in the first part.
In the second part we also develop a new linear-time streaming parsing
algorithm for parsing expression grammars (PEG) which generalizes the regular
grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm
reformulated using least fixed points and evaluated using an instance of the
chaotic iteration scheme by Cousot and Cousot
- ā¦