15,783 research outputs found
Substring filtering for low-cost linked data interfaces
Recently, Triple Pattern Fragments (TPFS) were introduced as a low-cost server-side interface when high numbers of clients need to evaluate SPARQL queries. Scalability is achieved by moving part of the query execution to the client, at the cost of elevated query times. Since the TPFS interface purposely does not support complex constructs such as SPARQL filters, queries that use them need to be executed mostly on the client, resulting in long execution times. We therefore investigated the impact of adding a literal substring matching feature to the TPFS interface, with the goal of improving query performance while maintaining low server cost. In this paper, we discuss the client/server setup and compare the performance of SPARQL queries on multiple implementations, including Elastic Search and case-insensitive FM-index. Our evaluations indicate that these improvements allow for faster query execution without significantly increasing the load on the server. Offering the substring feature on TPF servers allows users to obtain faster responses for filter-based SPARQL queries. Furthermore, substring matching can be used to support other filters such as complete regular expressions or range queries
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
Encoding dynamics for multiscale community detection: Markov time sweeping for the Map equation
The detection of community structure in networks is intimately related to
finding a concise description of the network in terms of its modules. This
notion has been recently exploited by the Map equation formalism (M. Rosvall
and C.T. Bergstrom, PNAS, 105(4), pp.1118--1123, 2008) through an
information-theoretic description of the process of coding inter- and
intra-community transitions of a random walker in the network at stationarity.
However, a thorough study of the relationship between the full Markov dynamics
and the coding mechanism is still lacking. We show here that the original Map
coding scheme, which is both block-averaged and one-step, neglects the internal
structure of the communities and introduces an upper scale, the `field-of-view'
limit, in the communities it can detect. As a consequence, Map is well tuned to
detect clique-like communities but can lead to undesirable overpartitioning
when communities are far from clique-like. We show that a signature of this
behavior is a large compression gap: the Map description length is far from its
ideal limit. To address this issue, we propose a simple dynamic approach that
introduces time explicitly into the Map coding through the analysis of the
weighted adjacency matrix of the time-dependent multistep transition matrix of
the Markov process. The resulting Markov time sweeping induces a dynamical
zooming across scales that can reveal (potentially multiscale) community
structure above the field-of-view limit, with the relevant partitions indicated
by a small compression gap.Comment: 10 pages, 6 figure
Expansion-maximization-compression algorithm with spherical harmonics for single particle imaging with X-ray lasers
In 3D single particle imaging with X-ray free-electron lasers, particle
orientation is not recorded during measurement but is instead recovered as a
necessary step in the reconstruction of a 3D image from the diffraction data.
Here we use harmonic analysis on the sphere to cleanly separate the angu- lar
and radial degrees of freedom of this problem, providing new opportunities to
efficiently use data and computational resources. We develop the
Expansion-Maximization-Compression algorithm into a shell-by-shell approach and
implement an angular bandwidth limit that can be gradually raised during the
reconstruction. We study the minimum number of patterns and minimum rotation
sampling required for a desired angular and radial resolution. These extensions
provide new av- enues to improve computational efficiency and speed of
convergence, which are critically important considering the very large datasets
expected from experiment
Optimising Unicode Regular Expression Evaluation with Previews
The jsre regular expression library was designed to provide fast matching of complex expressions over large input streams using user-selectable character encodings. An established design approach was used: a simulated non-deterministic automaton (NFA) implemented as a virtual machine, avoiding exponential cost functions in either space or time. A deterministic automaton (DFA) was chosen as a general dispatching mechanism for Unicode character classes and this also provided the opportunity to use compact DFAs in various optimization strategies. The result was the development of a regular expression Preview which provides a summary of all the matches possible from a given point in a regular expression in a form that can be implemented as a compact DFA and can be used to further improve the performance of the standard NFA simulation algorithm. This paper formally defines a preview and describes and evaluates several optimizations using this construct. They provide significant speed improvements accrued from fast scanning of anchor positions, avoiding retesting of repeated strings in unanchored searches, and efficient searching of multiple alternate expressions which in the case of keyword searching has a time complexity which is logarithmic in the number of words to be searched
- …