11 research outputs found

    Elastic-substitution decoding for hierarchical SMT: efficiency, richer search and double labels

    Get PDF
    Elastic-substitution decoding (ESD), first introduced by Chiang (2010), can be important for obtaining good results when applying labels to enrich hierarchical statistical machine translation (SMT). However, an efficient implementation is essential for scalable application. We describe how to achieve this, contributing essential details that were missing in the original exposition. We compare ESD to strict matching and show its superiority for both reordering and syntactic labels. To overcome the sub-optimal performance due to the late evaluation of features marking label substitution types, we increase the diversity of the rules explored during cube pruning initialization with respect to labels their labels. This approach gives significant improvements over basic ESD and performs favorably compared to extending the search by increasing the cube pruning pop-limit. Finally, we look at combining multiple labels. The combination of reordering labels and target-side boundary-tags yields a significant improvement in terms of the word-order sensitive metrics Kendall reordering score and METEOR. This confirms our intuition that the combination of reordering labels and syntactic labels can yield improvements over either label by itself, despite increased sparsity

    Using Categorial Grammar to Label Translation Rules

    Get PDF
    Adding syntactic labels to synchronous context-free translation rules can improve performance, but labeling with phrase structure constituents, as in GHKM (Galley et al., 2004), excludes potentially useful translation rules. SAMT (Zollmann and Venugopal, 2006) introduces heuristics to create new non-constituent labels, but these heuristics introduce many complex labels and tend to add rarely-applicable rules to the translation grammar. We introduce a labeling scheme based on categorial grammar, which allows syntactic labeling of many rules with a minimal, well-motivated label set. We show that our labeling scheme performs comparably to SAMT on an Urdu–English translation task, yet the label set is an order of magnitude smaller, and translation is twice as fast

    Integrating source-language context into log-linear models of statistical machine translation

    Get PDF
    The translation features typically used in state-of-the-art statistical machine translation (SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated that integrating source context modelling directly into log-linear phrase-based SMT (PB-SMT) and hierarchical PB-SMT (HPB-SMT), and can positively influence the weighting and selection of target phrases, and thus improve translation quality. In this thesis we present novel approaches to incorporate source-language contextual modelling into the state-of-the-art SMT models in order to enhance the quality of lexical selection. We investigate the effectiveness of use of a range of contextual features, including lexical features of neighbouring words, part-of-speech tags, supertags, sentence-similarity features, dependency information, and semantic roles. We explored a series of language pairs featuring typologically different languages, and examined the scalability of our research to larger amounts of training data. While our results are mixed across feature selections, language pairs, and learning curves, we observe that including contextual features of the source sentence in general produces improvements. The most significant improvements involve the integration of long-distance contextual features, such as dependency relations in combination with part-of-speech tags in Dutch-to-English subtitle translation, the combination of dependency parse and semantic role information in English-to-Dutch parliamentary debate translation, supertag features in English-to-Chinese translation, or combination of supertag and lexical features in English-to-Dutch subtitle translation. Furthermore, we investigate the applicability of our lexical contextual model in another closely related NLP problem, namely machine transliteration

    Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

    Get PDF

    Syntax-based machine translation using dependency grammars and discriminative machine learning

    Get PDF
    Machine translation underwent huge improvements since the groundbreaking introduction of statistical methods in the early 2000s, going from very domain-specific systems that still performed relatively poorly despite the painstakingly crafting of thousands of ad-hoc rules, to general-purpose systems automatically trained on large collections of bilingual texts which manage to deliver understandable translations that convey the general meaning of the original input. These approaches however still perform quite below the level of human translators, typically failing to convey detailed meaning and register, and producing translations that, while readable, are often ungrammatical and unidiomatic. This quality gap, which is considerably large compared to most other natural language processing tasks, has been the focus of the research in recent years, with the development of increasingly sophisticated models that attempt to exploit the syntactical structure of human languages, leveraging the technology of statistical parsers, as well as advanced machine learning methods such as marging-based structured prediction algorithms and neural networks. The translation software itself became more complex in order to accommodate for the sophistication of these advanced models: the main translation engine (the decoder) is now often combined with a pre-processor which reorders the words of the source sentences to a target language word order, or with a post-processor that ranks and selects a translation according according to fine model from a list of candidate translations generated by a coarse model. In this thesis we investigate the statistical machine translation problem from various angles, focusing on translation from non-analytic languages whose syntax is best described by fluid non-projective dependency grammars rather than the relatively strict phrase-structure grammars or projectivedependency grammars which are most commonly used in the literature. We propose a framework for modeling word reordering phenomena between language pairs as transitions on non-projective source dependency parse graphs. We quantitatively characterize reordering phenomena for the German-to-English language pair as captured by this framework, specifically investigating the incidence and effects of the non-projectivity of source syntax and the non-locality of word movement w.r.t. the graph structure. We evaluated several variants of hand-coded pre-ordering rules in order to assess the impact of these phenomena on translation quality. We propose a class of dependency-based source pre-ordering approaches that reorder sentences based on a flexible models trained by SVMs and and several recurrent neural network architectures. We also propose a class of translation reranking models, both syntax-free and source dependency-based, which make use of a type of neural networks known as graph echo state networks which is highly flexible and requires extremely little training resources, overcoming one of the main limitations of neural network models for natural language processing tasks

    Learning the Language of Biological Sequences

    Get PDF
    International audienceLearning the language of biological sequences is an appealing challenge for the grammatical inference research field.While some first successes have already been recorded, such as the inference of profile hidden Markov models or stochastic context-free grammars which are now part of the classical bioinformatics toolbox, it is still a source of open and nice inspirational problems for grammatical inference, enabling us to confront our ideas to real fundamental applications. As an introduction to this field, we survey here the main ideas and concepts behind the approaches developed in pattern/motif discovery and grammatical inference to characterize successfully the biological sequences with their specificities

    A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora

    Get PDF
    Statistical and rule-based methods are complementary approaches to machine translation (MT) that have different strengths and weaknesses. This complementarity has, over the last few years, resulted in the consolidation of a growing interest in hybrid systems that combine both data-driven and linguistic approaches. In this paper, we address the situation in which the amount of bilingual resources that is available for a particular language pair is not sufficiently large to train a competitive statistical MT system, but the cost and slow development cycles of rule-based MT systems cannot be afforded either. In this context, we formalise a new method that uses scarce parallel corpora to automatically infer a set of shallow-transfer rules to be integrated into a rule-based MT system, thus avoiding the need for human experts to handcraft these rules. Our work is based on the alignment template approach to phrase-based statistical MT, but the definition of the alignment template is extended to encompass different generalisation levels. It is also greatly inspired by the work of Sánchez-Martínez and Forcada (2009) in which alignment templates were also considered for shallow-transfer rule inference. However, our approach overcomes many relevant limitations of that work, principally those related to the inability to find the correct generalisation level for the alignment templates, and to select the subset of alignment templates that ensures an adequate segmentation of the input sentences by the rules eventually obtained. Unlike previous approaches in literature, our formalism does not require linguistic knowledge about the languages involved in the translation. Moreover, it is the first time that conflicts between rules are resolved by choosing the most appropriate ones according to a global minimisation function rather than proceeding in a pairwise greedy fashion. Experiments conducted using five different language pairs with the free/open-source rule-based MT platform Apertium show that translation quality significantly improves when compared to the method proposed by Sánchez-Martínez and Forcada (2009), and is close to that obtained using handcrafted rules. For some language pairs, our approach is even able to outperform them. Moreover, the resulting number of rules is considerably smaller, which eases human revision and maintenance.Research funded by Universitat d’Alacant through project GRE11-20, by the Spanish Ministry of Economy and Competitiveness through projects TIN2009-14009-C02-01 and TIN2012-32615, by Generalitat Valenciana through grant ACIF/2010/174, and by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

    CCG-augmented hierarchical phrase-based statistical machine translation

    Get PDF
    Augmenting Statistical Machine Translation (SMT) systems with syntactic information aims at improving translation quality. Hierarchical Phrase-Based (HPB) SMT takes a step toward incorporating syntax in Phrase-Based (PB) SMT by modelling one aspect of language syntax, namely the hierarchical structure of phrases. Syntax Augmented Machine Translation (SAMT) further incorporates syntactic information extracted using context free phrase structure grammar (CF-PSG) in the HPB SMT model. One of the main challenges facing CF-PSG-based augmentation approaches for SMT systems emerges from the difference in the definition of the constituent in CF-PSG and the ‘phrase’ in SMT systems, which hinders the ability of CF-PSG to express the syntactic function of many SMT phrases. Although the SAMT approach to solving this problem using ‘CCG-like’ operators to combine constituent labels improves syntactic constraint coverage, it significantly increases their sparsity, which restricts translation and negatively affects its quality. In this thesis, we address the problems of sparsity and limited coverage of syntactic constraints facing the CF-PSG-based syntax augmentation approaches for HPB SMT using Combinatory Cateogiral Grammar (CCG). We demonstrate that CCG’s flexible structures and rich syntactic descriptors help to extract richer, more expressive and less sparse syntactic constraints with better coverage than CF-PSG, which enables our CCG-augmented HPB system to outperform the SAMT system. We also try to soften the syntactic constraints imposed by CCG category nonterminal labels by extracting less fine-grained CCG-based labels. We demonstrate that CCG label simplification helps to significantly improve the performance of our CCG category HPB system. Finally, we identify the factors which limit the coverage of the syntactic constraints in our CCG-augmented HPB model. We then try to tackle these factors by extending the definition of the nonterminal label to be composed of a sequence of CCG categories and augmenting the glue grammar with CCG combinatory rules. We demonstrate that our extension approaches help to significantly increase the scope of the syntactic constraints applied in our CCG-augmented HPB model and achieve significant improvements over the HPB SMT baseline

    Congestion Control in Vehicular Ad Hoc Networks

    Get PDF
    RÉSUMÉ Les rĂ©seaux VĂ©hiculaires ad hoc (VANets) sont conçus pour permettre des communications sans fil fiables entre les nƓuds mobiles Ă  haute vitesse. Afin d'amĂ©liorer la performance des applications dans ce type de rĂ©seaux et garantir un environnement sĂ»r et confortable pour ses utilisateurs, la QualitĂ© de Service (QoS) doit ĂȘtre supportĂ©e dans ces rĂ©seaux. Le dĂ©lai ainsi que les pertes de paquets sont deux principaux indicateurs de QoS qui augmentent de maniĂšre significative en raison de la congestion dans les rĂ©seaux. En effet, la congestion du rĂ©seau entraĂźne une saturation des canaux ainsi qu’une augmentation des collisions de paquets dans les canaux. Par consĂ©quent, elle doit ĂȘtre contrĂŽlĂ©e pour rĂ©duire les pertes de paquets ainsi que le dĂ©lai, et amĂ©liorer les performances des rĂ©seaux vĂ©hiculaires. Le contrĂŽle de congestion dans les rĂ©seaux VANets est une tĂąche difficile en raison des caractĂ©ristiques spĂ©cifiques des VANets, telles que la grande mobilitĂ© des nƓuds Ă  haute vitesse, le taux Ă©levĂ© de changement de topologie, etc. Le contrĂŽle de congestion dans les rĂ©seaux VANets peut ĂȘtre effectuĂ© en ayant recours Ă  une stratĂ©gie qui utilise l'un des paramĂštres suivants : le taux de transmission, la puissance de transmission, la priorisation et l’ordonnancement, ainsi que les stratĂ©gies hybrides. Les stratĂ©gies de contrĂŽle de congestion dans les rĂ©seaux VANets doivent faire face Ă  quelques dĂ©fis tels que l'utilisation inĂ©quitable des ressources, la surcharge de communication, le dĂ©lai de transmission Ă©levĂ©, et l'utilisation inefficace de la bande passante, etc. Par consĂ©quent, il est nĂ©cessaire de dĂ©velopper de nouvelles approches pour faire face Ă  ces dĂ©fis et amĂ©liorer la performance des rĂ©seaux VANets. Dans cette thĂšse, dans un premier temps, une stratĂ©gie de contrĂŽle de congestion en boucle fermĂ©e est dĂ©veloppĂ©e. Cette stratĂ©gie est une mĂ©thode de contrĂŽle de congestion dynamique et distribuĂ©e qui dĂ©tecte la congestion en mesurant le niveau d'utilisation du canal. Ensuite, la congestion est contrĂŽlĂ©e en ajustant la portĂ©e et le taux de transmission qui ont un impact considĂ©rable sur la saturation du canal. Ajuster la portĂ©e et le taux de transmission au sein des VANets est un problĂšme NP-difficile en raison de la grande complexitĂ© de la dĂ©termination des valeurs appropriĂ©es pour ces paramĂštres. ConsidĂ©rant les avantages de la mĂ©thode de recherche Tabou et son adaptabilitĂ© au problĂšme, une mĂ©thode de recherche multi-objective est utilisĂ©e pour trouver une portĂ©e et un taux de transmission dans un dĂ©lai raisonnable. Le dĂ©lai et la gigue, fonctions multi-objectifs de l'algorithme Tabou, sont minimisĂ©s dans l'algorithme proposĂ©. Par la suite, deux stratĂ©gies de contrĂŽle de congestion en boucle ouverte sont proposĂ©es afin de rĂ©duire la congestion dans les canaux en utilisant la priorisation et l'ordonnancement des messages. Ces stratĂ©gies dĂ©finissent la prioritĂ© pour chaque message en considĂ©rant son type de contenu (par exemple les messages d'urgence, de beacon, et de service), la taille des messages, et l’état du rĂ©seau (par exemple, les mĂ©triques de la vĂ©locitĂ©, la direction, l'utilitĂ©, la distance, et la validitĂ©). L'ordonnancement des messages est effectuĂ© sur la base des prioritĂ©s dĂ©finies. De plus, comme seconde technique d'ordonnancement, une mĂ©thode de recherche Tabou est employĂ©e pour planifier les files d'attente de contrĂŽle et de service des canaux de transmission dans un dĂ©lai raisonnable. A cet effet, le dĂ©lai et la gigue lors de l'acheminement des messages sont minimisĂ©s. Enfin, une stratĂ©gie localisĂ©e et centralisĂ©e qui utilise les ensembles RSU fixĂ©s aux intersections pour dĂ©tecter et contrĂŽler de la congestion est proposĂ©e. Cette stratĂ©gie regroupe tous les messages transfĂ©rĂ©s entre les vĂ©hicules qui se sont arrĂȘtĂ©s Ă  une lumiĂšre de signalisation en utilisant les algorithmes de Machine Learning. Dans cette stratĂ©gie, un algorithme de k-means est utilisĂ© pour regrouper les messages en fonction de leurs caractĂ©ristiques (par exemple la taille des messages, la validitĂ© des messages, et le type de messages, etc.). Les paramĂštres de communication, y compris le portĂ©e et le taux de transmission, la taille de la fenĂȘtre de contention, et le paramĂštre AIFS (Arbitration Inter-Frame Spacing) sont dĂ©terminĂ©s pour chaque grappe de messages en vue de minimiser le dĂ©lai de livraison. Ensuite, les paramĂštres de communication dĂ©terminĂ©s sont envoyĂ©s aux vĂ©hicules par les RSUs, et les vĂ©hicules opĂšrent en fonction de ces paramĂštres pour le transfert des messages. Les performances des trois stratĂ©gies proposĂ©es ont Ă©tĂ© Ă©valuĂ©es en simulant des scĂ©narios dans les autoroutes et la circulation urbaine avec les simulateurs NS2 et SUMO. Des comparaisons ont aussi Ă©tĂ© faites entre les rĂ©sultats obtenus Ă  partir des stratĂ©gies proposĂ©es et les stratĂ©gies de contrĂŽle de congestion communĂ©ment utilisĂ©es. Les rĂ©sultats rĂ©vĂšlent qu’avec les stratĂ©gies de contrĂŽle de congestion proposĂ©es, le dĂ©bit du rĂ©seau augmente et le taux de perte de paquets ainsi que de dĂ©lai diminuent de maniĂšre significative en comparaison aux autres stratĂ©gies. Par consĂ©quent, l'application des mĂ©thodes proposĂ©es aide Ă  amĂ©liorer la performance, la suretĂ© et la fiabilitĂ© des VANets.----------ABSTRACT Vehicular Ad hoc Networks (VANets) are designed to provide reliable wireless communications between high-speed mobile nodes. In order to improve the performance of VANets’ applications, and make a safe and comfort environment for VANets’ users, Quality of Service (QoS) should be supported in these networks. The delay and packet losses are two main indicators of QoS that dramatically increase due to the congestion occurrence in the networks. Indeed, due to congestion occurrence, the channels are saturated and the packet collisions increase in the channels. Therefore, the congestion should be controlled to decrease the packet losses and delay, and to increase the performance of VANets. Congestion control in VANets is a challenging task due to the specific characteristics of VANets such as high mobility of the nodes with high speed, and high rate of topology changes, and so on. Congestion control in VANets can be carried out using the strategies that can be classified into rate-based, power-based, CSMA/CA-based, prioritizing and scheduling-based, and hybrid strategies. The congestion control strategies in VANets face to some challenges such as unfair resources usage, communication overhead, high transmission delay, and inefficient bandwidth utilization, and so on. Therefore, it is required to develop new strategies to cope with these challenges and improve the performance of VANets. In this dissertation, first, a closed-loop congestion control strategy is developed. This strategy is a dynamic and distributed congestion control strategy that detects the congestion by measuring the channel usage level. Then, the congestion is controlled by tuning the transmission range and rate that considerably impact on the channel saturation. Tuning the transmission range and rate in VANets is an NP-hard problem due to the high complexity of determining the proper values for these parameters in vehicular networks. Considering the benefits of Tabu search algorithm and its adaptability with the problem, a multi-objective Tabu search algorithm is used for tuning transmission range and rate in reasonable time. In the proposed algorithm, the delay and jitter are minimized as the objective functions of multi-objective Tabu Search algorithm. Second, two open-loop congestion control strategies are proposed that prevent the congestion occurrence in the channels using the prioritizing and scheduling the messages. These strategies define the priority for each message by considering the content of messages (i.e. types of the messages for example emergency, beacon, and service messages), size of messages, and state of the networks (e.g. velocity, direction, usefulness, distance and validity metrics). The scheduling of the messages is conducted based on the defined priorities. In addition, as the second scheduling technique, a Tabu Search algorithm is employed to schedule the control and service channel queues in a reasonable time. For this purpose, the delay and jitter of messages delivery are minimized. Finally, a localized and centralized strategy is proposed that uses RSUs set at intersections for detecting and controlling the congestion. These strategy clusters all the messages that transferred between the vehicles stopped before the red traffic light using Machine Learning algorithms. In this strategy, a K-means learning algorithm is used for clustering the messages based on their features (e.g. size of messages, validity of messages, and type of messages, and so on). The communication parameters including the transmission range and rate, contention window size, and Arbitration Inter-Frame Spacing (AIFS) are determined for each messages cluter based on the minimized delivery delay. Then, the determined communication parameters are sent to the vehicles by RSUs, and the vehicles operate based on these parameters for transferring the messages. The performances of three proposed strategies were evaluated by simulating the highway and urban scenarios in NS2 and SUMO simulators. Comparisons were also made between the results obtained from the proposed strategies and the common used congestion control strategies. The results reveal that using the proposed congestion control strategies, the throughput, packet loss ratio and delay are significantly improved as compared to the other strategies. Therefore, applications of the proposed strategies help improve the performance, safety, and reliability of VANets
    corecore