14,667 research outputs found

    Lexicalization and Grammar Development

    Get PDF
    In this paper we present a fully lexicalized grammar formalism as a particularly attractive framework for the specification of natural language grammars. We discuss in detail Feature-based, Lexicalized Tree Adjoining Grammars (FB-LTAGs), a representative of the class of lexicalized grammars. We illustrate the advantages of lexicalized grammars in various contexts of natural language processing, ranging from wide-coverage grammar development to parsing and machine translation. We also present a method for compact and efficient representation of lexicalized trees.Comment: ps file. English w/ German abstract. 10 page

    All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch

    Get PDF
    Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information

    Measuring Syntactic Complexity in Spoken and Written Learner Language: Comparing the Incomparable?

    Get PDF
    Spoken and written language are two modes of language. When learners aim at higher skill levels, the expected outcome of successful second language learning is usually to become a fluent speaker and writer who can produce accurate and complex language in the target language. There is an axiomatic difference between speech and writing, but together they form the essential parts of learners’ L2 skills. The two modes have their own characteristics, and there are differences between native and nonnative language use. For instance, hesitations and pauses are not visible in the end result of the writing process, but they are characteristic of nonnative spoken language use. The present study is based on the analysis of L2 English spoken and written productions of 18 L1 Finnish learners with focus on syntactic complexity. As earlier spoken language segmentation units mostly come from fluency studies, we conducted an experiment with a new unit, the U-unit, and examined how using this unit as the basis of spoken language segmentation affects the results. According to the analysis, written language was more complex than spoken language. However, the difference in the level of complexity was greatest when the traditional units, T-units and AS-units, were used in segmenting the data. Using the U-unit revealed that spoken language may, in fact, be closer to written language in its syntactic complexity than earlier studies had suggested. Therefore, further research is needed to discover whether the differences in spoken and written learner language are primarily due to the nature of these modes or, rather, to the units and measures used in the analysis

    Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

    Full text link
    An automatic word classification system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a binary top-down form of word clustering which employs an average class mutual information metric. Resulting classifications are hierarchical, allowing variable class granularity. Words are represented as structural tags --- unique nn-bit numbers the most significant bit-patterns of which incorporate class information. Access to a structural tag immediately provides access to all classification levels for the corresponding word. The classification system has successfully revealed some of the structure of English, from the phonemic to the semantic level. The system has been compared --- directly and indirectly --- with other recent word classification systems. Class based interpolated language models have been constructed to exploit the extra information supplied by the classifications and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
    corecore