11,693 research outputs found
Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English
The necessity of using a fixed-size word vocabulary in order to control the
model complexity in state-of-the-art neural machine translation (NMT) systems
is an important bottleneck on performance, especially for morphologically rich
languages. Conventional methods that aim to overcome this problem by using
sub-word or character-level representations solely rely on statistics and
disregard the linguistic properties of words, which leads to interruptions in
the word structure and causes semantic and syntactic losses. In this paper, we
propose a new vocabulary reduction method for NMT, which can reduce the
vocabulary of a given input corpus at any rate while also considering the
morphological properties of the language. Our method is based on unsupervised
morphology learning and can be, in principle, used for pre-processing any
language pair. We also present an alternative word segmentation method based on
supervised morphological analysis, which aids us in measuring the accuracy of
our model. We evaluate our method in Turkish-to-English NMT task where the
input language is morphologically rich and agglutinative. We analyze different
representation methods in terms of translation accuracy as well as the semantic
and syntactic properties of the generated output. Our method obtains a
significant improvement of 2.3 BLEU points over the conventional vocabulary
reduction technique, showing that it can provide better accuracy in open
vocabulary translation of morphologically rich languages.Comment: The 20th Annual Conference of the European Association for Machine
Translation (EAMT), Research Paper, 12 page
AmAMorph: Finite State Morphological Analyzer for Amazighe
This paper presents AmAMorph, a morphological analyzer for Amazighe language using a system based on the NooJ linguistic development environment. The paper begins with the development of Amazighe lexicons with large coverage formalization. The built electronic lexicons, named âNAmLexâ, âVAmLexâ and âPAmLexâ which stand for âNoun Amazighe Lexiconâ, âVerb Amazighe Lexiconâ and âParticles Amazighe Lexiconâ, link inflectional, morphological, and syntacticsemantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over inflected forms. To our knowledge,AmAMorph is the first morphological analyzer for Amazighe. It identifies the component morphemes of the forms using large coverage morphological grammars. Along with the description of how the analyzer is implemented, this paper gives an evaluation of the analyzer
Plant responses to decadal scale increments in atmospheric CO2 concentration: comparing two stomatal conductance sampling methods
There are several lines of evidence suggesting that the vast majority of C3 plants respond to elevated atmospheric CO2 by decreasing their stomatal conductance (gs). However, in the majority of CO2 enrichment studies, the response to elevated CO2 are tested between plants grown under ambient (380â420 ppm) and high (538â680 ppm) CO2 concentrations and measured usually at single time points in a diurnal cycle. We investigated gs responses to simulated decadal increments in CO2 predicted over the next 4 decades and tested how measurements of gs may differ when two alternative sampling methods are employed (infrared gas analyzer [IRGA] vs. leaf porometer). We exposed Populus tremula, Popolus tremuloides and Sambucus racemosa to four different CO2 concentrations over 126 days in experimental growth chambers at 350, 420, 490 and 560 ppm CO2; representing the years 1987, 2025, 2051, and 2070, respectively (RCP4.5 scenario). Our study demonstrated that the species respond non-linearly to increases in CO2 concentration when exposed to decadal changes in CO2. Under natural conditions, maximum operational gs is often reached in the late morning to early afternoon, with a mid-day depression around noon. However, we showed that the daily maximum gs can, in some species, shift later into the day when plants are exposed to only small increases (70 ppm) in CO2. A non-linear decreases in gs and a shifting diurnal stomatal behavior under elevated CO2, could affect the long-term daily water and carbon budget of many plants in the future, and therefore alter soilâplantâatmospheric processes.Irish Research CouncilScience Foundation Irelan
A Machine learning approach to POS tagging
We have applied inductive learning of statistical decision trees
and relaxation labelling to the Natural Language Processing (NLP)
task of morphosyntactic disambiguation (Part Of Speech Tagging).
The learning process is supervised and obtains a language
model oriented to resolve POS ambiguities. This model consists
of a set of statistical decision trees expressing distribution of
tags and words in some relevant contexts.
The acquired language models are complete enough to be directly
used as sets of POS disambiguation rules, and include more complex
contextual information than simple collections of n-grams usually
used in statistical taggers.
We have implemented a quite simple and fast tagger that has been
tested and evaluated on the Wall Street Journal (WSJ) corpus with
a remarkable accuracy.
However, better results can be obtained by translating the trees
into rules to feed a flexible relaxation labelling based tagger.
In this direction we describe a tagger which is able to use
information of any kind (n-grams, automatically acquired constraints,
linguistically motivated manually written constraints, etc.), and in
particular to incorporate the machine learned decision trees.
Simultaneously, we address the problem of tagging when only
small training material is available, which is crucial in any process
of constructing, from scratch, an annotated corpus. We show that quite
high accuracy can be achieved with our system in this situation.Postprint (published version
Views from the coalface: chemo-sensors, sensor networks and the semantic sensor web
Currently millions of sensors are being deployed in sensor networks across the world. These networks generate vast quantities of heterogeneous data across various levels of spatial and temporal granularity. Sensors range from single-point in situ sensors to remote satellite sensors which can cover the globe. The semantic sensor web in principle should allow for the unification of the web with the real-word. In this position paper, we discuss the major challenges to this unification from the perspective of sensor developers (especially chemo-sensors) and integrating sensors data in real-world deployments. These challenges include: (1) identifying the quality of the data; (2) heterogeneity of data sources and data transport methods; (3) integrating data streams from different sources and modalities (esp. contextual information), and (4) pushing intelligence to the sensor level
- âŚ