1,495 research outputs found

    Data-Oriented Language Processing. An Overview

    Full text link
    During the last few years, a new approach to language processing has started to emerge, which has become known under various labels such as "data-oriented parsing", "corpus-based interpretation", and "tree-bank grammar" (cf. van den Berg et al. 1994; Bod 1992-96; Bod et al. 1996a/b; Bonnema 1996; Charniak 1996a/b; Goodman 1996; Kaplan 1996; Rajman 1995a/b; Scha 1990-92; Sekine & Grishman 1995; Sima'an et al. 1994; Sima'an 1995-96; Tugwell 1995). This approach, which we will call "data-oriented processing" or "DOP", embodies the assumption that human language perception and production works with representations of concrete past language experiences, rather than with abstract linguistic rules. The models that instantiate this approach therefore maintain large corpora of linguistic representations of previously occurring utterances. When processing a new input utterance, analyses of this utterance are constructed by combining fragments from the corpus; the occurrence-frequencies of the fragments are used to estimate which analysis is the most probable one. In this paper we give an in-depth discussion of a data-oriented processing model which employs a corpus of labelled phrase-structure trees. Then we review some other models that instantiate the DOP approach. Many of these models also employ labelled phrase-structure trees, but use different criteria for extracting fragments from the corpus or employ different disambiguation strategies (Bod 1996b; Charniak 1996a/b; Goodman 1996; Rajman 1995a/b; Sekine & Grishman 1995; Sima'an 1995-96); other models use richer formalisms for their corpus annotations (van den Berg et al. 1994; Bod et al., 1996a/b; Bonnema 1996; Kaplan 1996; Tugwell 1995).Comment: 34 pages, Postscrip

    Automata Tutor v3

    Full text link
    Computer science class enrollments have rapidly risen in the past decade. With current class sizes, standard approaches to grading and providing personalized feedback are no longer possible and new techniques become both feasible and necessary. In this paper, we present the third version of Automata Tutor, a tool for helping teachers and students in large courses on automata and formal languages. The second version of Automata Tutor supported automatic grading and feedback for finite-automata constructions and has already been used by thousands of users in dozens of countries. This new version of Automata Tutor supports automated grading and feedback generation for a greatly extended variety of new problems, including problems that ask students to create regular expressions, context-free grammars, pushdown automata and Turing machines corresponding to a given description, and problems about converting between equivalent models - e.g., from regular expressions to nondeterministic finite automata. Moreover, for several problems, this new version also enables teachers and students to automatically generate new problem instances. We also present the results of a survey run on a class of 950 students, which shows very positive results about the usability and usefulness of the tool

    Combining semantic and syntactic structure for language modeling

    Full text link
    Structured language models for speech recognition have been shown to remedy the weaknesses of n-gram models. All current structured language models are, however, limited in that they do not take into account dependencies between non-headwords. We show that non-headword dependencies contribute to significantly improved word error rate, and that a data-oriented parsing model trained on semantically and syntactically annotated data can exploit these dependencies. This paper also contains the first DOP model trained by means of a maximum likelihood reestimation procedure, which solves some of the theoretical shortcomings of previous DOP models.Comment: 4 page

    An improved parser for data-oriented lexical-functional analysis

    Full text link
    We present an LFG-DOP parser which uses fragments from LFG-annotated sentences to parse new sentences. Experiments with the Verbmobil and Homecentre corpora show that (1) Viterbi n best search performs about 100 times faster than Monte Carlo search while both achieve the same accuracy; (2) the DOP hypothesis which states that parse accuracy increases with increasing fragment size is confirmed for LFG-DOP; (3) LFG-DOP's relative frequency estimator performs worse than a discounted frequency estimator; and (4) LFG-DOP significantly outperforms Tree-DOP is evaluated on tree structures only.Comment: 8 page

    Streaming algorithms for language recognition problems

    Get PDF
    We study the complexity of the following problems in the streaming model. Membership testing for \DLIN We show that every language in \DLIN\ can be recognised by a randomized one-pass O(logn)O(\log n) space algorithm with inverse polynomial one-sided error, and by a deterministic p-pass O(n/p)O(n/p) space algorithm. We show that these algorithms are optimal. Membership testing for \LL(k)(k) For languages generated by \LL(k)(k) grammars with a bound of rr on the number of nonterminals at any stage in the left-most derivation, we show that membership can be tested by a randomized one-pass O(rlogn)O(r\log n) space algorithm with inverse polynomial (in nn) one-sided error. Membership testing for \DCFL We show that randomized algorithms as efficient as the ones described above for \DLIN\ and \LL(k) (which are subclasses of \DCFL) cannot exist for all of \DCFL: there is a language in \VPL\ (a subclass of \DCFL) for which any randomized p-pass algorithm with error bounded by ϵ<1/2\epsilon < 1/2 must use Ω(n/p)\Omega(n/p) space. Degree sequence problem We study the problem of determining, given a sequence d1,d2,...,dnd_1, d_2,..., d_n and a graph GG, whether the degree sequence of GG is precisely d1,d2,...,dnd_1, d_2,..., d_n. We give a randomized one-pass O(logn)O(\log n) space algorithm with inverse polynomial one-sided error probability. We show that our algorithms are optimal. Our randomized algorithms are based on the recent work of Magniez et al. \cite{MMN09}; our lower bounds are obtained by considering related communication complexity problems

    Discontinuity and asymmetry in phrase structure grammars

    Get PDF
    corecore