257 research outputs found

    Sanskrit Sandhi Splitting using seq2(seq)^2

    Full text link
    In Sanskrit, small words (morphemes) are combined to form compound words through a process known as Sandhi. Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Although rules governing word splitting exists in the language, it is highly challenging to identify the location of the splits in a compound word. Though existing Sandhi splitting systems incorporate these pre-defined splitting rules, they have a low accuracy as the same compound word might be broken down in multiple ways to provide syntactically correct splits. In this research, we propose a novel deep learning architecture called Double Decoder RNN (DD-RNN), which (i) predicts the location of the split(s) with 95% accuracy, and (ii) predicts the constituent words (learning the Sandhi splitting rules) with 79.5% accuracy, outperforming the state-of-art by 20%. Additionally, we show the generalization capability of our deep learning model, by showing competitive results in the problem of Chinese word segmentation, as well.Comment: Accepted in EMNLP 201

    A case study in decompounding for Bengali information retrieval

    Get PDF
    Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. Some unique characteristics of Bengali compounding are: i) only one constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of sandhi in contrast to simple concatenation. While the standard approach of decompounding based on maximization of the total frequency of the constituents formed by candidate split positions has proven beneficial for European languages, our reported experiments in this paper show that such a standard approach does not work particularly well for Bengali IR. As a solution, we firstly propose a more relaxed decompounding where a compound word can be decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by employing a co-occurrence threshold to ensure that the constituent often co-occurs with the compound word, which in this case is representative of how related are the constituents with the compound. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition. improving MAP up to 2:72% and recall up to 1:8%

    Asymmetries of consonant sequences in perception and production: affricates vs. /s/ clusters

    Get PDF
    This paper investigates the behavior of Greek affricates as opposed to other clusters consisting of /s/ + obstruent and obstruent + /s/ sequences. An experimental task testing the perception of /s/ clusters demonstrated a fixed preference for the preservation of affricates over obstruent +/s/ over /s/ + obstruent clusters. Subjects showed a strong tendency to break up consonantal sequences, while they retained affricates intact. This linguistic behavior is attributed to two factors: a) identity of place of articulation of the members of the examined consonantal sequences and b) satisfaction of the scale of consonantal strength in these sequences. 

    Monitoring English Sandhi Linking – A Study of Polish Listeners’ L2 Perception

    Get PDF
    This paper presents a set of word monitoring experiments with Polish learners of English. Listeners heard short recordings of native English speech, and were instructed to respond when they recognized an English target word that had been presented on a computer screen. Owing to phonological considerations, we compared reaction times to two types of vowel-initial words, which had been produced either with glottalization, or had been joined via sandhi linking processes to the preceding word. Results showed that the effects of the glottalization as a boundary cue were less robust than expected. Implications of these findings for models of L2 speech are discussed. It is suggested that the prevalence of glottalization in L1 production makes listeners less sensitive to its effects as a boundary cue in L2

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
    corecore