11,693 research outputs found

    Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

    Get PDF
    The necessity of using a fixed-size word vocabulary in order to control the model complexity in state-of-the-art neural machine translation (NMT) systems is an important bottleneck on performance, especially for morphologically rich languages. Conventional methods that aim to overcome this problem by using sub-word or character-level representations solely rely on statistics and disregard the linguistic properties of words, which leads to interruptions in the word structure and causes semantic and syntactic losses. In this paper, we propose a new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language. Our method is based on unsupervised morphology learning and can be, in principle, used for pre-processing any language pair. We also present an alternative word segmentation method based on supervised morphological analysis, which aids us in measuring the accuracy of our model. We evaluate our method in Turkish-to-English NMT task where the input language is morphologically rich and agglutinative. We analyze different representation methods in terms of translation accuracy as well as the semantic and syntactic properties of the generated output. Our method obtains a significant improvement of 2.3 BLEU points over the conventional vocabulary reduction technique, showing that it can provide better accuracy in open vocabulary translation of morphologically rich languages.Comment: The 20th Annual Conference of the European Association for Machine Translation (EAMT), Research Paper, 12 page

    AmAMorph: Finite State Morphological Analyzer for Amazighe

    Get PDF
    This paper presents AmAMorph, a morphological analyzer for Amazighe language using a system based on the NooJ linguistic development environment. The paper begins with the development of Amazighe lexicons with large coverage formalization. The built electronic lexicons, named ‘NAmLex’, ‘VAmLex’ and ‘PAmLex’ which stand for ‘Noun Amazighe Lexicon’, ‘Verb Amazighe Lexicon’ and ‘Particles Amazighe Lexicon’, link inflectional, morphological, and syntacticsemantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over inflected forms. To our knowledge,AmAMorph is the first morphological analyzer for Amazighe. It identifies the component morphemes of the forms using large coverage morphological grammars. Along with the description of how the analyzer is implemented, this paper gives an evaluation of the analyzer

    Plant responses to decadal scale increments in atmospheric CO2 concentration: comparing two stomatal conductance sampling methods

    Get PDF
    There are several lines of evidence suggesting that the vast majority of C3 plants respond to elevated atmospheric CO2 by decreasing their stomatal conductance (gs). However, in the majority of CO2 enrichment studies, the response to elevated CO2 are tested between plants grown under ambient (380–420 ppm) and high (538–680 ppm) CO2 concentrations and measured usually at single time points in a diurnal cycle. We investigated gs responses to simulated decadal increments in CO2 predicted over the next 4 decades and tested how measurements of gs may differ when two alternative sampling methods are employed (infrared gas analyzer [IRGA] vs. leaf porometer). We exposed Populus tremula, Popolus tremuloides and Sambucus racemosa to four different CO2 concentrations over 126 days in experimental growth chambers at 350, 420, 490 and 560 ppm CO2; representing the years 1987, 2025, 2051, and 2070, respectively (RCP4.5 scenario). Our study demonstrated that the species respond non-linearly to increases in CO2 concentration when exposed to decadal changes in CO2. Under natural conditions, maximum operational gs is often reached in the late morning to early afternoon, with a mid-day depression around noon. However, we showed that the daily maximum gs can, in some species, shift later into the day when plants are exposed to only small increases (70 ppm) in CO2. A non-linear decreases in gs and a shifting diurnal stomatal behavior under elevated CO2, could affect the long-term daily water and carbon budget of many plants in the future, and therefore alter soil–plant–atmospheric processes.Irish Research CouncilScience Foundation Irelan

    A Machine learning approach to POS tagging

    Get PDF
    We have applied inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities. This model consists of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired language models are complete enough to be directly used as sets of POS disambiguation rules, and include more complex contextual information than simple collections of n-grams usually used in statistical taggers. We have implemented a quite simple and fast tagger that has been tested and evaluated on the Wall Street Journal (WSJ) corpus with a remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine learned decision trees. Simultaneously, we address the problem of tagging when only small training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that quite high accuracy can be achieved with our system in this situation.Postprint (published version

    Views from the coalface: chemo-sensors, sensor networks and the semantic sensor web

    Get PDF
    Currently millions of sensors are being deployed in sensor networks across the world. These networks generate vast quantities of heterogeneous data across various levels of spatial and temporal granularity. Sensors range from single-point in situ sensors to remote satellite sensors which can cover the globe. The semantic sensor web in principle should allow for the unification of the web with the real-word. In this position paper, we discuss the major challenges to this unification from the perspective of sensor developers (especially chemo-sensors) and integrating sensors data in real-world deployments. These challenges include: (1) identifying the quality of the data; (2) heterogeneity of data sources and data transport methods; (3) integrating data streams from different sources and modalities (esp. contextual information), and (4) pushing intelligence to the sensor level
    • …
    corecore