83,706 research outputs found
On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching
We present parallel algorithms for exact and approximate pattern matching
with suffix arrays, using a CREW-PRAM with processors. Given a static text
of length , we first show how to compute the suffix array interval of a
given pattern of length in
time for . For approximate pattern matching with differences or
mismatches, we show how to compute all occurrences of a given pattern in
time, where is the size of the alphabet
and . The workhorse of our algorithms is a data structure
for merging suffix array intervals quickly: Given the suffix array intervals
for two patterns and , we present a data structure for computing the
interval of in sequential time, or in
parallel time. All our data structures are of size bits (in addition to
the suffix array)
Logical segmentation for article extraction in digitized old newspapers
Newspapers are documents made of news item and informative articles. They are
not meant to be red iteratively: the reader can pick his items in any order he
fancies. Ignoring this structural property, most digitized newspaper archives
only offer access by issue or at best by page to their content. We have built a
digitization workflow that automatically extracts newspaper articles from
images, which allows indexing and retrieval of information at the article
level. Our back-end system extracts the logical structure of the page to
produce the informative units: the articles. Each image is labelled at the
pixel level, through a machine learning based method, then the page logical
structure is constructed up from there by the detection of structuring entities
such as horizontal and vertical separators, titles and text lines. This logical
structure is stored in a METS wrapper associated to the ALTO file produced by
the system including the OCRed text. Our front-end system provides a web high
definition visualisation of images, textual indexing and retrieval facilities,
searching and reading at the article level. Articles transcriptions can be
collaboratively corrected, which as a consequence allows for better indexing.
We are currently testing our system on the archives of the Journal de Rouen,
one of France eldest local newspaper. These 250 years of publication amount to
300 000 pages of very variable image quality and layout complexity. Test year
1808 can be consulted at plair.univ-rouen.fr.Comment: ACM Document Engineering, France (2012
- …