181,695 research outputs found
QUASII: QUery-Aware Spatial Incremental Index.
With large-scale simulations of increasingly detailed models and improvement of data acquisition technologies, massive amounts of data are easily and quickly created and collected. Traditional systems require indexes to be built before analytic queries can be executed efficiently. Such an indexing step requires substantial computing resources and introduces a considerable and growing data-to-insight gap where scientists need to wait before they can perform any analysis. Moreover, scientists often only use a small fraction of the data - the parts containing interesting phenomena - and indexing it fully does not always pay off. In this paper we develop a novel incremental index for the exploration of spatial data. Our approach, QUASII, builds a data-oriented index as a side-effect of query execution. QUASII distributes the cost of indexing across all queries, while building the index structure only for the subset of data queried. It reduces data-to-insight time and curbs the cost of incremental indexing by gradually and partially sorting the data, while producing a data-oriented hierarchical structure at the same time. As our experiments show, QUASII reduces the data-to-insight time by up to a factor of 11.4x, while its performance converges to that of the state-of-the-art static indexes
Data Structure Lower Bounds for Document Indexing Problems
We study data structure problems related to document indexing and pattern
matching queries and our main contribution is to show that the pointer machine
model of computation can be extremely useful in proving high and unconditional
lower bounds that cannot be obtained in any other known model of computation
with the current techniques. Often our lower bounds match the known space-query
time trade-off curve and in fact for all the problems considered, there is a
very good and reasonable match between the our lower bounds and the known upper
bounds, at least for some choice of input parameters. The problems that we
consider are set intersection queries (both the reporting variant and the
semi-group counting variant), indexing a set of documents for two-pattern
queries, or forbidden- pattern queries, or queries with wild-cards, and
indexing an input set of gapped-patterns (or two-patterns) to find those
matching a document given at the query time.Comment: Full version of the conference version that appeared at ICALP 2016,
25 page
A resource-frugal probabilistic dictionary and applications in (meta)genomics
Genomic and metagenomic fields, generating huge sets of short genomic
sequences, brought their own share of high performance problems. To extract
relevant pieces of information from the huge data sets generated by current
sequencing techniques, one must rely on extremely scalable methods and
solutions. Indexing billions of objects is a task considered too expensive
while being a fundamental need in this field. In this paper we propose a
straightforward indexing structure that scales to billions of element and we
propose two direct applications in genomics and metagenomics. We show that our
proposal solves problem instances for which no other known solution scales-up.
We believe that many tools and applications could benefit from either the
fundamental data structure we provide or from the applications developed from
this structure.Comment: Submitted to PSC 201
A suite of software for processing MicroED data of extremely small protein crystals.
Electron diffraction of extremely small three-dimensional crystals (MicroED) allows for structure determination from crystals orders of magnitude smaller than those used for X-ray crystallography. MicroED patterns, which are collected in a transmission electron microscope, were initially not amenable to indexing and intensity extraction by standard software, which necessitated the development of a suite of programs for data processing. The MicroED suite was developed to accomplish the tasks of unit-cell determination, indexing, background subtraction, intensity measurement and merging, resulting in data that can be carried forward to molecular replacement and structure determination. This ad hoc solution has been modified for more general use to provide a means for processing MicroED data until the technique can be fully implemented into existing crystallographic software packages. The suite is written in Python and the source code is available under a GNU General Public License
Logical segmentation for article extraction in digitized old newspapers
Newspapers are documents made of news item and informative articles. They are
not meant to be red iteratively: the reader can pick his items in any order he
fancies. Ignoring this structural property, most digitized newspaper archives
only offer access by issue or at best by page to their content. We have built a
digitization workflow that automatically extracts newspaper articles from
images, which allows indexing and retrieval of information at the article
level. Our back-end system extracts the logical structure of the page to
produce the informative units: the articles. Each image is labelled at the
pixel level, through a machine learning based method, then the page logical
structure is constructed up from there by the detection of structuring entities
such as horizontal and vertical separators, titles and text lines. This logical
structure is stored in a METS wrapper associated to the ALTO file produced by
the system including the OCRed text. Our front-end system provides a web high
definition visualisation of images, textual indexing and retrieval facilities,
searching and reading at the article level. Articles transcriptions can be
collaboratively corrected, which as a consequence allows for better indexing.
We are currently testing our system on the archives of the Journal de Rouen,
one of France eldest local newspaper. These 250 years of publication amount to
300 000 pages of very variable image quality and layout complexity. Test year
1808 can be consulted at plair.univ-rouen.fr.Comment: ACM Document Engineering, France (2012
- …