121 research outputs found
Locally Chain-Parsable Languages
If a context-free language enjoys the local parsability property then, no matter how the source string is segmented, each segment can be parsed in- dependently, and an efficient parallel parsing algorithm becomes possible. The new class of locally chain-parsable languages (LCPL), included in deterministic context-free languages, is here defined by means of the chain-driven automa- ton and characterized by decidable properties of grammar derivations. Such au- tomaton decides to reduce or not a factor in a way purely driven by the terminal characters, thus extending the well-known concept of Input-Driven (ID) (visibly) pushdown machines. LCPL extend and improve the practically relevant operator- precedence languages (Floyd), which are known to strictly include the ID lan- guages, and for which a parallel-parser generator exists. Consistently with the classical results for ID, chain-compatible LCPL are closed under reversal and Boolean operations, and language inclusion is decidable
Toward a theory of input-driven locally parsable languages
If a context-free language enjoys the local parsability property then, no matter how the source string is segmented, each segment can be parsed independently, and an efficient parallel parsing algorithm becomes possible. The new class of locally chain parsable languages (LCPLs), included in the deterministic context-free language family, is here defined by means of the chain-driven automaton and characterized by decidable properties of grammar derivations. Such automaton decides whether to reduce or not a substring in a way purely driven by the terminal characters, thus extending the well-known concept of input-driven (ID) alias visibly pushdown machines. The LCPL family extends and improves the practically relevant Floyd's operator-precedence (OP) languages which are known to strictly include the ID languages, and for which a parallel-parser generator exists
Parallel parsing made practical
The property of local parsability allows to parse inputs through inspecting only a bounded-length string around the current token. This in turn enables the construction of a scalable, data-parallel parsing algorithm, which is presented in this work. Such an algorithm is easily amenable to be automatically generated via a parser generator tool, which was realized, and is also presented in the following. Furthermore, to complete the framework of a parallel input analysis, a parallel scanner can also combined with the parser. To prove the practicality of a parallel lexing and parsing approach, we report the results of the adaptation of JSON and Lua to a form fit for parallel parsing (i.e. an operator-precedence grammar) through simple grammar changes and scanning transformations. The approach is validated with performance figures from both high performance and embedded multicore platforms, obtained analyzing real-world inputs as a test-bench. The results show that our approach matches or dominates the performances of production-grade LR parsers in sequential execution, and achieves significant speedups and good scaling on multi-core machines. The work is concluded by a broad and critical survey of the past work on parallel parsing and future directions on the integration with semantic analysis and incremental parsing
音声翻訳における文解析技法について
本文データは平成22年度国立国会図書館の学位論文(博士)のデジタル化実施により作成された画像ファイルを基にpdf変換したものである京都大学0048新制・論文博士博士(工学)乙第8652号論工博第2893号新制||工||968(附属図書館)UT51-94-R411(主査)教授 長尾 真, 教授 堂下 修司, 教授 池田 克夫学位規則第4条第2項該当Doctor of EngineeringKyoto UniversityDFA
Recommended from our members
Extraction of chemical structures and reactions from the literature
The ever increasing quantity of chemical literature necessitates
the creation of automated techniques for extracting relevant information.
This work focuses on two aspects: the conversion of chemical names to
computer readable structure representations and the extraction of chemical
reactions from text.
Chemical names are a common way of communicating chemical structure
information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an
open source, freely available algorithm for converting chemical names to
structures was developed. OPSIN employs a regular grammar to direct
tokenisation and parsing leading to the generation of an XML parse tree.
Nomenclature operations are applied successively to the tree with many
requiring the manipulation of an in-memory connection table representation
of the structure under construction. Areas of nomenclature supported are
described with attention being drawn to difficulties that may be
encountered in name to structure conversion. Results on sets of generated
names and names extracted from patents are presented. On generated names,
recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9%
on precision with all results either being comparable or superior to the
tested commercial solutions. On the patent names OPSIN s recall was 2-10%
higher than the tested solutions when the patent names were processed as
found in the patents. The uses of OPSIN as a web service and as a tool for
identifying chemical names in text are shown to demonstrate the direct
utility of this algorithm.
A software system for extracting chemical reactions from the text of
chemical patents was developed. The system relies on the output of
ChemicalTagger, a tool for tagging words and identifying phrases of
importance in experimental chemistry text. Improvements to this tool
required to facilitate this task are documented. The structure of chemical
entities are where possible determined using OPSIN in conjunction with a
dictionary of name to structure relationships. Extracted reactions are
atom mapped to confirm that they are chemically consistent. 424,621 atom
mapped reactions were extracted from 65,034 organic chemistry USPTO
patents. On a sample of 100 of these extracted reactions chemical entities
were identified with 96.4% recall and 88.9% precision. Quantities could be
associated with reagents in 98.8% of cases and 64.9% of cases for products
whilst the correct role was assigned to chemical entities in 91.8% of
cases. Qualitatively the system captured the essence of the reaction in
95% of cases. This system is expected to be useful in the creation of
searchable databases of reactions from chemical patents and in
facilitating analysis of the properties of large populations of reactions
Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve
Can we analyze data without decompressing it? As our data keeps growing, understanding the time complexity of problems on compressed inputs, rather than in convenient uncompressed forms, becomes more and more relevant. Suppose we are given a compression of size of data that originally has size , and we want to solve a problem with time complexity . The naive strategy of "decompress-and-solve" gives time , whereas "the gold standard" is time : to analyze the compression as efficiently as if the original data was small. We restrict our attention to data in the form of a string (text, files, genomes, etc.) and study the most ubiquitous tasks. While the challenge might seem to depend heavily on the specific compression scheme, most methods of practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be unified under the elegant notion of Grammar Compressions. A vast literature, across many disciplines, established this as an influential notion for Algorithm design. We introduce a framework for proving (conditional) lower bounds in this field, allowing us to assess whether decompress-and-solve can be improved, and by how much. Our main results are: - The bound for LCS and the bound for Pattern Matching with Wildcards are optimal up to factors, under the Strong Exponential Time Hypothesis. (Here, denotes the uncompressed length of the compressed pattern.) - Decompress-and-solve is essentially optimal for Context-Free Grammar Parsing and RNA Folding, under the -Clique conjecture. - We give an algorithm showing that decompress-and-solve is not optimal for Disjointness
A Drop-in Replacement for LR(1) Table-Driven Parsing
This paper presents a construction method for a deterministic one-symbol look-ahead LR parser which allows non-terminals in the parser look-ahead. This effectively relaxes the requirement of parsing the reverse of the right-most derivation of a string/sentence. This is achieved by replacing the deterministic push down automata of LR parsing by a two-stack automata. The class of grammars accepted by the two-stack parser properly contains the LR(k) grammars. Since the modification to the table-driven LR parsing process is relatively minor and mostly impacts the creation of the goto and action tables, a parser modified to adopt the two-stack process should be comparable in size and performance to LR parsers.</p
Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve
Can we analyze data without decompressing it? As our data keeps growing,
understanding the time complexity of problems on compressed inputs, rather than
in convenient uncompressed forms, becomes more and more relevant. Suppose we
are given a compression of size of data that originally has size , and
we want to solve a problem with time complexity . The naive strategy
of "decompress-and-solve" gives time , whereas "the gold standard" is
time : to analyze the compression as efficiently as if the original data
was small.
We restrict our attention to data in the form of a string (text, files,
genomes, etc.) and study the most ubiquitous tasks. While the challenge might
seem to depend heavily on the specific compression scheme, most methods of
practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be
unified under the elegant notion of Grammar Compressions. A vast literature,
across many disciplines, established this as an influential notion for
Algorithm design.
We introduce a framework for proving (conditional) lower bounds in this
field, allowing us to assess whether decompress-and-solve can be improved, and
by how much. Our main results are:
- The bound for LCS and the
bound for Pattern Matching with Wildcards are optimal up to factors,
under the Strong Exponential Time Hypothesis. (Here, denotes the
uncompressed length of the compressed pattern.)
- Decompress-and-solve is essentially optimal for Context-Free Grammar
Parsing and RNA Folding, under the -Clique conjecture.
- We give an algorithm showing that decompress-and-solve is not optimal for
Disjointness.Comment: Presented at FOCS'17. Full version. 63 page
- …