40 research outputs found

    Handling Syntactic Extra-Grammaticality

    Full text link
    This paper reviews and summarizes six different types of extra-grammatical phenomena and their corresponding recovery principles at the syntactic level, and describes some techniques used to deal with four of them completely within an Extended GLR parser (EGLR). Partial solutions to the remaining two by the EGLR parser are also discussed. The EGLR has been implemented. 1 Introduction Extragrammatical phenomena in natural languages are very common and there has been much effort devoted to dealing with them (Carbonell -- Hayes, 1983; DARPA 1991, 1992). Although (Generalized)LR parsers have many merits when applied to NL, most progress with extra-grammatical phenomena has been through rule-based systems, in contrast to the applications of LR parsers in programming languages. In this paper some techniques are developed to extend the ability of a (G)LR parser in dealing with extra-grammatical phenomena, though similar techniques can also be applied in other parsers. The extended GLR (EGLR..

    Exploring features for identifying edited regions in disfluent sentences

    Full text link
    This paper describes our effort on the task of edited region identification for parsing disfluent sentences in the Switchboard corpus. We focus our attention on exploring feature spaces and selecting good features and start with analyzing the distributions of the edited regions and their components in the targeted corpus. We explore new feature spaces of a partof-speech (POS) hierarchy and relaxed for rough copy in the experiments. These steps result in an improvement of 43.98% percent relative error reduction in F-score over an earlier best result in edited detection when punctuation is included in both training and testing data [Charniak and Johnson 2001], and 20.44 % percent relative error reduction in F-score over the latest best result where punctuation is excluded from the training and testing data [Johnson and Charniak 2004].

    Computing Confidence Scores for All Sub Parse Trees

    Full text link
    Computing confidence scores for applications, such as dialogue system, information retrieving and extraction, is an active research area. However, its focus has been primarily on computing word-, concept-, or utterance-level confidences. Motivated by the need from sophisticated dialogue systems for more effective dialogs, we generalize the confidence annotation to all the subtrees, the first effort in this line of research. The other contribution of this work is that we incorporated novel long distance features to address challenges in computing multi-level confidence scores. Using Conditional Maximum Entropy (CME) classifier with all the selected features, we reached an annotation error rate of 26.0 % in the SWBD corpus, compared with a subtree error rate of 41.91%, a closely related benchmark with the Charniak parser from (Kahn et al., 2005).

    Partitioning Grammars and composing Parsers

    Full text link
    GLR parsers have been criticized by various authors for their potentially large sizes. Other parsers also have individual weaknesses and strengths. Our heterogeneous parsing algorithm is based on GLR parsers that handle the size and other problems by partitioning the grammar during compiling and assembling the partitioned sub-grammars during parsing. We discuss different parsers for different grammars and present some intuitive considerations for the partitioning. 1 Introduction Various authors [9, 18, 19, 11] have criticized the size of (G)LR parsers as being too big for programming languages and natural languages. In addition, there exist context-free grammars G whose collection of LR(0) items is exponentially larger than jGj. Other parsers also have their own weaknesses and strengths. A number of attempts have been made to counter these problems[9, 12], and in this paper we present a new approach. The main idea is to look for certain partitions of a grammar. Instead of having compi..

    Adaptive Language Learning

    Full text link
    This paper first gives a brief survey on current efforts in language learning 1 . Next, it presents a description of our learning system, an Adaptive Language Learner (AL Learner), for Context-Free Grammars (CFGs). This language learner is based on adaptation process, the EGLR parser serves as assimilation process and AL Learner serves as accomodation process. Then, an analysis on some complexity problems related to our learning methods is given. Finally, we conclude this paper by a comparison of our method with others. 1 Introduction Language learning is fascinating topic and there has been much effort lately, devoted to corpora-based techniques. In section 2, we first give a brief survey on current efforts in language learning. Then, we present our learning algorithm, Adaptive Language Learner (AL Learner), for Context-Free Grammars (CFGs). An analysis on some complexity issues related to our learning method is given in section 4. Finally, we conclude this paper by a comparison of..

    A Maximum Entropy Framework that Integrates Word Dependencies and Grammatical Relations for Reading Comprehension

    Full text link
    Automatic reading comprehension (RC) systems can analyze a given passage and generate/extract answers in response to questions about the passage. The RC passages are often constrained in their lengths and the target answer sentence usually occurs very few times. In order to generate/extract a specific precise answer, this paper proposes the integration of two types of “deep ” linguistic features, namely word dependencies and grammatical relations, in a maximum entropy (ME) framework to handle the RC task. The proposed approach achieves 44.7 % and 73.2 % HumSent accuracy on the Remedia and ChungHwa corpora respectively. This result is competitive with other results reported thus far.

    Hub4 language modeling using domain interpolation and data clustering

    Full text link
    In SRI’s language modeling experiments for the Hub4 domain, three basic approaches were pursued: interpolating multiple models estimated from Hub4 and non-Hub4 training data, adapting the language model (LM) to the focus conditions, and adapting the LM to different topic types. In the first approach, we built separate LMs for the closely transcribed Hub4 material (acoustic training transcripts) and the loosely transcribed Hub4 material (LM training data), as well as the North-American Business News (NABN) and Switchboard training data, projected onto the Hub4 vocabulary. By interpolating the probabilities obtained from these models, we obtained a 20 % reduction in perplexity and a 1.8 % reduction in word error rate, compared to a baseline Hub4-only language model. Two adaptation approaches are also described: adapting language models to the speech styles correlated with different focus conditions, and building cluster-specific LM mixtures. These two approaches give some reduction in perplexity, but no significant reduction in word error. Finally, we identify the problems and future directions of our work. 1
    corecore