6 research outputs found

    Extraction of Simple Sentences from Mixed Sentences for Building Korean Case Frames

    Get PDF

    Regularity and Variation in Japanese Recipes: A Comparative Analysis of Cookbook, Online, and User-generated Sub-registers

    Get PDF
    This paper investigates the similarities and differences between three sub-registers of Japanese recipe texts: cookbook recipes, online commercial recipes written/edited by professionals, and online usergenerated recipes. Past studies on Japanese recipes do not distinguish different sub-registers, and they tend to focus on a single feature. The present study of the sub-registers examines a group of frequently appearing linguistic features and uncovers functional links between observed features and situational characteristics. The comparative perspective contributes to a more comprehensive understanding of the Japanese recipe language as well as universal and language-specific aspects of register variation. Shared traits among the three sub-registers are tied to the common topic of cooking and the central purpose of providing easy-to-follow food preparation instructions. Varied linguistic and textual features are motivated by different production circumstances, mediums, and relations among the participants. Professionally edited cookbook and online commercial recipes show a much higher uniformity in their grammatical features than unedited/self-edited user-generated recipes. Online sub-registers share a role of serving as a repository and reference center for numerous recipes and related information. Relationships among writers, readers, and other participants such as publishers and site organizers differ among all three sub-registers, resulting in some unique linguistic patterns

    JACY - a grammar for annotating syntax, semantics and pragmatics of written and spoken japanese for NLP application purposes

    Get PDF
    In this text, we describe the development of a broad coverage grammar for Japanese that has been built for and used in different application contexts. The grammar is based on work done in the Verbmobil project (Siegel 2000) on machine translation of spoken dialogues in the domain of travel planning. The second application for JACY was the automatic email response task. Grammar development was described in Oepen et al. (2002a). Third, it was applied to the task of understanding material on mobile phones available on the internet, while embedded in the project DeepThought (Callmeier et al. 2004, Uszkoreit et al. 2004). Currently, it is being used for treebanking and ontology extraction from dictionary definition sentences by the Japanese company NTT (Bond et al. 2004)

    Statistical models for case ambiguity resolution in Korean

    Get PDF

    Japanese mimetics as prenominal modifiers: The distribution of accented and accentless mimetics

    Get PDF
    This thesis investigates the grammatical properties and functions of Japanese mimetics when they are used as prenominal modifiers. I focus on the cases where mimetics modify nouns with physical referents. I argue that mimetic-na (M-na) should be considered neither ungrammatical nor less acceptable than other modifiers, contrary to suggestions in the previous literature. Looking at different grammatical markers combined with a mimetic, I demonstrate that M-na gives rise to a situation-descriptive reading, that mimetic-sita (M-sita) denotes a characterizing property and that mimetic-no (M-no) denotes a defining property, in Roy’s (2013) terms. The thesis includes examples in French, Russian and Spanish to illustrate these three different interpretations. As for the syntactic structures of mimetic modifiers, I demonstrate that M-na is a tensed clausal modifier, while M-sita is a tenseless attributive modifier, following Hamano (1986, 1988, 1998). More specifically, I claim that M-sita is an AP. I provide evidence showing that M-na is tensed (allowing a temporally anchored interpretation), whereas M-sita disallows tensed interpretations. There is currently no consensus about the grammatical status of M-no. Based on the distributions of mimetic and non-mimetic words presented in this thesis, I suggest that M-no can be marked by either the genitive or the copula. Each of the modifiers enters into a stacking structure when they occur together. I show that semantics associate with structural positions, and argue that mimetic modifiers appear in the order of M-na, M-sita, M-no in a hierarchical structure. This thesis sheds light on the various grammatical properties of mimetics in relation to their prosody. In broad agreement with previous research, I claim that accentless mimetics, as in M-na and M-no, denote an abstract quality, while I argue that M-sita (which involves an accented mimetic) denotes a physical concrete property. I consider the bare accented mimetics to be somewhat verb-like

    Lexical database enrichment through semi-automated morphological analysis

    Get PDF
    Derivational morphology proposes meaningful connections between words and is largely unrepresented in lexical databases. This thesis presents a project to enrich a lexical database with morphological links and to evaluate their contribution to disambiguation. A lexical database with sense distinctions was required. WordNet was chosen because of its free availability and widespread use. Its suitability was assessed through critical evaluation with respect to specifications and criticisms, using a transparent, extensible model. The identification of serious shortcomings suggested a portable enrichment methodology, applicable to alternative resources. Although 40% of the most frequent words are prepositions, they have been largely ignored by computational linguists, so addition of prepositions was also required. The preferred approach to morphological enrichment was to infer relations from phenomena discovered algorithmically. Both existing databases and existing algorithms can capture regular morphological relations, but cannot capture exceptions correctly; neither of them provide any semantic information. Some morphological analysis algorithms are subject to the fallacy that morphological analysis can be performed simply by segmentation. Morphological rules, grounded in observation and etymology, govern associations between and attachment of suffixes and contribute to defining the meaning of morphological relationships. Specifying character substitutions circumvents the segmentation fallacy. Morphological rules are prone to undergeneration, minimised through a variable lexical validity requirement, and overgeneration, minimised by rule reformulation and restricting monosyllabic output. Rules take into account the morphology of ancestor languages through co-occurrences of morphological patterns. Multiple rules applicable to an input suffix need their precedence established. The resistance of prefixations to segmentation has been addressed by identifying linking vowel exceptions and irregular prefixes. The automatic affix discovery algorithm applies heuristics to identify meaningful affixes and is combined with morphological rules into a hybrid model, fed only with empirical data, collected without supervision. Further algorithms apply the rules optimally to automatically pre-identified suffixes and break words into their component morphemes. To handle exceptions, stoplists were created in response to initial errors and fed back into the model through iterative development, leading to 100% precision, contestable only on lexicographic criteria. Stoplist length is minimised by special treatment of monosyllables and reformulation of rules. 96% of words and phrases are analysed. 218,802 directed derivational links have been encoded in the lexicon rather than the wordnet component of the model because the lexicon provides the optimal clustering of word senses. Both links and analyser are portable to an alternative lexicon. The evaluation uses the extended gloss overlaps disambiguation algorithm. The enriched model outperformed WordNet in terms of recall without loss of precision. Failure of all experiments to outperform disambiguation by frequency reflects on WordNet sense distinctions
    corecore