377 research outputs found

    AT-GIS: highly parallel spatial query processing with associative transducers

    Get PDF
    Users in many domains, including urban planning, transportation, and environmental science want to execute analytical queries over continuously updated spatial datasets. Current solutions for largescale spatial query processing either rely on extensions to RDBMS, which entails expensive loading and indexing phases when the data changes, or distributed map/reduce frameworks, running on resource-hungry compute clusters. Both solutions struggle with the sequential bottleneck of parsing complex, hierarchical spatial data formats, which frequently dominates query execution time. Our goal is to fully exploit the parallelism offered by modern multicore CPUs for parsing and query execution, thus providing the performance of a cluster with the resources of a single machine. We describe AT-GIS, a highly-parallel spatial query processing system that scales linearly to a large number of CPU cores. ATGIS integrates the parsing and querying of spatial data using a new computational abstraction called associative transducers(ATs). ATs can form a single data-parallel pipeline for computation without requiring the spatial input data to be split into logically independent blocks. Using ATs, AT-GIS can execute, in parallel, spatial query operators on the raw input data in multiple formats, without any pre-processing. On a single 64-core machine, AT-GIS provides 3Ă— the performance of an 8-node Hadoop cluster with 192 cores for containment queries, and 10Ă— for aggregation queries

    Parsing XML Using Parallel Traversal of Streaming Trees

    Full text link
    Abstract. XML has been widely adopted across a wide spectrum of applica-tions. Its parsing efficiency, however, remains a concern, and can be a bottleneck. With the current trend towards multicore CPUs, parallelization to improve per-formance is increasingly relevant. In many applications, the XML is streamed from the network, and thus the complete XML document is never in memory at any single moment in time. Parallel parsing of such a stream can be equated to parallel depth-first traversal of a streaming tree. Existing research on parallel tree traversal has assumed the entire tree was available in-memory, and thus cannot be directly applied. In this paper we investigate parallel, SAX-style parsing of XML via a parallel, depth-first traversal of the streaming document. We show good scalability up to about 6 cores on a Linux platform.

    Stream Processing using Grammars and Regular Expressions

    Full text link
    In this dissertation we study regular expression based parsing and the use of grammatical specifications for the synthesis of fast, streaming string-processing programs. In the first part we develop two linear-time algorithms for regular expression based parsing with Perl-style greedy disambiguation. The first algorithm operates in two passes in a semi-streaming fashion, using a constant amount of working memory and an auxiliary tape storage which is written in the first pass and consumed by the second. The second algorithm is a single-pass and optimally streaming algorithm which outputs as much of the parse tree as is semantically possible based on the input prefix read so far, and resorts to buffering as many symbols as is required to resolve the next choice. Optimality is obtained by performing a PSPACE-complete pre-analysis on the regular expression. In the second part we present Kleenex, a language for expressing high-performance streaming string processing programs as regular grammars with embedded semantic actions, and its compilation to streaming string transducers with worst-case linear-time performance. Its underlying theory is based on transducer decomposition into oracle and action machines, and a finite-state specialization of the streaming parsing algorithm presented in the first part. In the second part we also develop a new linear-time streaming parsing algorithm for parsing expression grammars (PEG) which generalizes the regular grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm reformulated using least fixed points and evaluated using an instance of the chaotic iteration scheme by Cousot and Cousot

    FST Morphology for the Endangered Skolt Sami Language

    Get PDF
    Peer reviewe

    Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools

    Get PDF
    In my habilitation dissertation, meant to validate my capacity of and maturity for directingresearch activities, I present a panorama of several topics in computational linguistics, linguisticsand computer science.Over the past decade, I was notably concerned with the phenomena of compositionalityand variability of linguistic objects. I illustrate the advantages of a compositional approachto the language in the domain of emotion detection and I explain how some linguistic objects,most prominently multi-word expressions, defy the compositionality principles. I demonstratethat the complex properties of MWEs, notably variability, are partially regular and partiallyidiosyncratic. This fact places the MWEs on the frontiers between different levels of linguisticprocessing, such as lexicon and syntax.I show the highly heterogeneous nature of MWEs by citing their two existing taxonomies.After an extensive state-of-the art study of MWE description and processing, I summarizeMultiflex, a formalism and a tool for lexical high-quality morphosyntactic description of MWUs.It uses a graph-based approach in which the inflection of a MWU is expressed in function ofthe morphology of its components, and of morphosyntactic transformation patterns. Due tounification the inflection paradigms are represented compactly. Orthographic, inflectional andsyntactic variants are treated within the same framework. The proposal is multilingual: it hasbeen tested on six European languages of three different origins (Germanic, Romance and Slavic),I believe that many others can also be successfully covered. Multiflex proves interoperable. Itadapts to different morphological language models, token boundary definitions, and underlyingmodules for the morphology of single words. It has been applied to the creation and enrichmentof linguistic resources, as well as to morphosyntactic analysis and generation. It can be integratedinto other NLP applications requiring the conflation of different surface realizations of the sameconcept.Another chapter of my activity concerns named entities, most of which are particular types ofMWEs. Their rich semantic load turned them into a hot topic in the NLP community, which isdocumented in my state-of-the art survey. I present the main assumptions, processes and resultsissued from large annotation tasks at two levels (for named entities and for coreference), parts ofthe National Corpus of Polish construction. I have also contributed to the development of bothrule-based and probabilistic named entity recognition tools, and to an automated enrichment ofProlexbase, a large multilingual database of proper names, from open sources.With respect to multi-word expressions, named entities and coreference mentions, I pay aspecial attention to nested structures. This problem sheds new light on the treatment of complexlinguistic units in NLP. When these units start being modeled as trees (or, more generally, asacyclic graphs) rather than as flat sequences of tokens, long-distance dependencies, discontinu-ities, overlapping and other frequent linguistic properties become easier to represent. This callsfor more complex processing methods which control larger contexts than what usually happensin sequential processing. Thus, both named entity recognition and coreference resolution comesvery close to parsing, and named entities or mentions with their nested structures are analogous3to multi-word expressions with embedded complements.My parallel activity concerns finite-state methods for natural language and XML processing.My main contribution in this field, co-authored with 2 colleagues, is the first full-fledged methodfor tree-to-language correction, and more precisely for correcting XML documents with respectto a DTD. We have also produced interesting results in incremental finite-state algorithmics,particularly relevant to data evolution contexts such as dynamic vocabularies or user updates.Multilingualism is the leitmotif of my research. I have applied my methods to several naturallanguages, most importantly to Polish, Serbian, English and French. I have been among theinitiators of a highly multilingual European scientific network dedicated to parsing and multi-word expressions. I have used multilingual linguistic data in experimental studies. I believethat it is particularly worthwhile to design NLP solutions taking declension-rich (e.g. Slavic)languages into account, since this leads to more universal solutions, at least as far as nominalconstructions (MWUs, NEs, mentions) are concerned. For instance, when Multiflex had beendeveloped with Polish in mind it could be applied as such to French, English, Serbian and Greek.Also, a French-Serbian collaboration led to substantial modifications in morphological modelingin Prolexbase in its early development stages. This allowed for its later application to Polishwith very few adaptations of the existing model. Other researchers also stress the advantages ofNLP studies on highly inflected languages since their morphology encodes much more syntacticinformation than is the case e.g. in English.In this dissertation I am also supposed to demonstrate my ability of playing an active rolein shaping the scientific landscape, on a local, national and international scale. I describemy: (i) various scientific collaborations and supervision activities, (ii) roles in over 10 regional,national and international projects, (iii) responsibilities in collective bodies such as program andorganizing committees of conferences and workshops, PhD juries, and the National UniversityCouncil (CNU), (iv) activity as an evaluator and a reviewer of European collaborative projects.The issues addressed in this dissertation open interesting scientific perspectives, in whicha special impact is put on links among various domains and communities. These perspectivesinclude: (i) integrating fine-grained language data into the linked open data, (ii) deep parsingof multi-word expressions, (iii) modeling multi-word expression identification in a treebank as atree-to-language correction problem, and (iv) a taxonomy and an experimental benchmark fortree-to-language correction approaches

    Automated NDT inspection for large and complex geometries of composite materials

    Get PDF
    Large components with complex geometries, made of composite materials, have become very common in modern structures. To cope with future demand projections, it is necessary to overcome the current non-destructive testing (NDT) bottlenecks encountered during the inspection phase of manufacture. This thesis investigates several aspects of the introduction of automation within the inspection process of complex parts. The use of six-axis robots for product inspection and non-destructive testing systems is the central investigation of this thesis. The challenges embraced by the research include the development of a novel controlling approach for robotic manipulators and of novel path-planning strategies. The integration of robot manipulators and NDT data acquisition instruments is optimized. An effective and reliable way to encode the NDT data through the interpolated robot feedback positions is implemented. The viability of the new external control method is evaluated experimentally. The observed maximum position and orientation errors are respectively within 2mm and within 1 degree, over an operating envelope of 3m³. A new software toolbox (RoboNDT), aimed at NDT technicians, has been developed during this work. RoboNDT is intended to transform the robot path-planning problem into an easy step of the inspection process. The software incorporates the novel path-planning algorithms developed during this research and is shaped to overcome practical limitations of current OLP software. The software has been experimentally validated using scans on real high value aerospace components. RoboNDT delivers tool-path errors that are lower than the errors given by commercial off-line path-planning software. For example the variability of the standoff is within 10 mm for the tool-paths created with the commercial software and within 4.5 mm for the RoboNDT tool-paths, over a scanned area of 1.6m². The output of this research was used to support a 3-year industrial project, called IntACom and led by TWI on behalf of major aerospace sponsors. The result is a demonstrator system, currently in use at TWI Technology Centre, which is capable of inspecting complex geometries with high throughput. The IntACom system can scan real components 2.8 times faster than traditional 3-DoF scanners deploying phased-array inspection and 6.7 times faster than commercial gantry systems deploying traditional single-element inspection.Large components with complex geometries, made of composite materials, have become very common in modern structures. To cope with future demand projections, it is necessary to overcome the current non-destructive testing (NDT) bottlenecks encountered during the inspection phase of manufacture. This thesis investigates several aspects of the introduction of automation within the inspection process of complex parts. The use of six-axis robots for product inspection and non-destructive testing systems is the central investigation of this thesis. The challenges embraced by the research include the development of a novel controlling approach for robotic manipulators and of novel path-planning strategies. The integration of robot manipulators and NDT data acquisition instruments is optimized. An effective and reliable way to encode the NDT data through the interpolated robot feedback positions is implemented. The viability of the new external control method is evaluated experimentally. The observed maximum position and orientation errors are respectively within 2mm and within 1 degree, over an operating envelope of 3m³. A new software toolbox (RoboNDT), aimed at NDT technicians, has been developed during this work. RoboNDT is intended to transform the robot path-planning problem into an easy step of the inspection process. The software incorporates the novel path-planning algorithms developed during this research and is shaped to overcome practical limitations of current OLP software. The software has been experimentally validated using scans on real high value aerospace components. RoboNDT delivers tool-path errors that are lower than the errors given by commercial off-line path-planning software. For example the variability of the standoff is within 10 mm for the tool-paths created with the commercial software and within 4.5 mm for the RoboNDT tool-paths, over a scanned area of 1.6m². The output of this research was used to support a 3-year industrial project, called IntACom and led by TWI on behalf of major aerospace sponsors. The result is a demonstrator system, currently in use at TWI Technology Centre, which is capable of inspecting complex geometries with high throughput. The IntACom system can scan real components 2.8 times faster than traditional 3-DoF scanners deploying phased-array inspection and 6.7 times faster than commercial gantry systems deploying traditional single-element inspection
    • …
    corecore