523 research outputs found
Natural Language Processing at the School of Information Studies for Africa
The lack of persons trained in computational linguistic methods is a severe obstacle to making the Internet and computers accessible to people all over the world in their own languages.
The paper discusses the experiences of designing and teaching an introductory course in Natural Language Processing to graduate computer science students at Addis Ababa University, Ethiopia, in order to initiate the education of computational linguists in the Horn of Africa region
Development of tag sets for part-of-speech tagging
This article discusses tag sets used when PoS-tagging a corpus, that is, enriching a corpus by adding a part-of-speech tag to each word. This requires a tag-set, a list of grammatical category labels; a tagging scheme, practical definitions of each tag or label, showing words and contexts where each tag applies; and a tagger, a program for assigning a tag to each word in the corpus, implementing the tag-set and tagging-scheme in a tag-assignment algorithm. We start by reviewing tag-sets developed for English corpora in section 1, since English was the first language studied by corpus linguists. Pioneering corpus linguists thought that their English corpora could be more useful research resources if each word was annotated with a Part-of-Speech label or tag. Traditional English grammars generally provide 8 basic parts of speech, derived from Latin grammar. However, most tag-set developers wanted to capture finer grammatical distinctions, leading to larger tag-sets. PoS-tagged English corpora have been used in a wide range of applications. Section 2 examines criteria used in development of English corpus Part-of-Speech tag sets: mnemonic tag names; underlying linguistic theory; classification by form or function; analysis of idiosyncratic words; categorization problems; tokenisation issues: defining what counts as a word; multi-word lexical items; target user and/or application; availability and/or adaptability of tagger software; adherence to standards; variations in genre, register, or type of language; and degree of delicacy of the tag-set. To illustrate these issues, section 3 outlines a range of examples of tag set developments for different languages, and discusses how these criteria apply. First we consider tag-sets for an online Part-of-Speech tagging service for English; then design of a tag-set for another language from the same broad Indo-European language family, Urdu; then for a non-Indo-European language with a highly inflexional grammar, Arabic; then for a contrasting non-Indo-European language with isolating grammar, Malay. Finally, we present some conclusions in section 4, and references in section 5
Focus to emphasize tone analysis for prosodic generation
AbstractEmphasizing prosody of a sentence at its focus part when producing a speakerās utterance can improve the recognition rate to hearers and reduce its ambiguity. Our objective is to address this challenge by analysing the concept of foci in speech utterances and the relationship of focus, speakerās intention and prosody. Our investigation is aimed at understanding and modelling how a speakerās utterances are influenced by the speakerās intentions. The relationship between speakerās intentions and focus information is used to consider which parts of the sentence serve as the focus parts. We propose using the Focus to Emphasize Tone (FET) analysis, which includes: (i) generating the constraints for foci, speakerās intention and prosodic features, (ii) defining the intonation patterns, (iii) labelling a set of prosodic marks for a sentence. We also design the FET structure to support our analysis and to contain focus, speakerās intention and prosodic components. An implementation of the system is described and the evaluation results on the CMU Communicator (CMUāCOM) dataset are presented
Macro Grammars and Holistic Triggering for Efficient Semantic Parsing
To learn a semantic parser from denotations, a learning algorithm must search
over a combinatorially large space of logical forms for ones consistent with
the annotated denotations. We propose a new online learning algorithm that
searches faster as training progresses. The two key ideas are using macro
grammars to cache the abstract patterns of useful logical forms found thus far,
and holistic triggering to efficiently retrieve the most relevant patterns
based on sentence similarity. On the WikiTableQuestions dataset, we first
expand the search space of an existing model to improve the state-of-the-art
accuracy from 38.7% to 42.7%, and then use macro grammars and holistic
triggering to achieve an 11x speedup and an accuracy of 43.7%.Comment: EMNLP 201
High efficiency realization for a wide-coverage unification grammar
We give a detailed account of an algorithm for efficient tactical generation from underspecified logical-form semantics, using a wide-coverage grammar and a corpus of real-world target utterances. Some earlier claims about chart realization are critically reviewed and corrected in the light of a series of practical experiments. As well as a set of algorithmic refinements, we present two novel techniques: the integration of subsumption-based local ambiguity factoring, and a procedure to selectively unpack the generation forest according to a probability distribution given by a conditional, discriminative model
Example-based machine translation of the Basque language
Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus
(270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art
approaches according to several common automatic evaluation metrics
Edinburgh's Statistical Machine Translation Systems for WMT16
This paper describes the University of Edinburghās
phrase-based and syntax-based
submissions to the shared translation tasks
of the ACL 2016 First Conference on Machine
Translation (WMT16). We submitted
five phrase-based and five syntaxbased
systems for the news task, plus one
phrase-based system for the biomedical
task
- ā¦