2,244 research outputs found
Statistical Function Tagging and Grammatical Relations of Myanmar Sentences
This paper describes a context free grammar (CFG) based grammatical relations
for Myanmar sentences which combine corpus-based function tagging system. Part
of the challenge of statistical function tagging for Myanmar sentences comes
from the fact that Myanmar has free-phrase-order and a complex morphological
system. Function tagging is a pre-processing step to show grammatical relations
of Myanmar sentences. In the task of function tagging, which tags the function
of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the
possible function tags of a word. We apply context free grammar (CFG) to find
out the grammatical relations of the function tags. We also create a functional
annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar
sentences. Experiments show that our analysis achieves a good result with
simple sentences and complex sentences.Comment: 16 pages, 7 figures, 8 tables, AIAA-2011 (India). arXiv admin note:
text overlap with arXiv:0912.1820 by other author
Diacritic Restoration and the Development of a Part-of-Speech Tagset for the MÄori Language
This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap:
Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely MÄori, and an accuracy exceeding 99% was observed.
Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in MÄori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the MÄori grammar
A Note on Zipf's Law, Natural Languages, and Noncoding DNA regions
In Phys. Rev. Letters (73:2, 5 Dec. 94), Mantegna et al. conclude on the
basis of Zipf rank frequency data that noncoding DNA sequence regions are more
like natural languages than coding regions. We argue on the contrary that an
empirical fit to Zipf's ``law'' cannot be used as a criterion for similarity to
natural languages. Although DNA is a presumably an ``organized system of
signs'' in Mandelbrot's (1961) sense, an observation of statistical features of
the sort presented in the Mantegna et al. paper does not shed light on the
similarity between DNA's ``grammar'' and natural language grammars, just as the
observation of exact Zipf-like behavior cannot distinguish between the
underlying processes of tossing an sided die or a finite-state branching
process.Comment: compressed uuencoded postscript file: 14 page
Learning Language from a Large (Unannotated) Corpus
A novel approach to the fully automated, unsupervised extraction of
dependency grammars and associated syntax-to-semantic-relationship mappings
from large text corpora is described. The suggested approach builds on the
authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well
as on a number of prior papers and approaches from the statistical language
learning literature. If successful, this approach would enable the mining of
all the information needed to power a natural language comprehension and
generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa
Recursive Neural Networks Can Learn Logical Semantics
Tree-structured recursive neural networks (TreeRNNs) for sentence meaning
have been successful for many applications, but it remains an open question
whether the fixed-length representations that they learn can support tasks as
demanding as logical deduction. We pursue this question by evaluating whether
two such models---plain TreeRNNs and tree-structured neural tensor networks
(TreeRNTNs)---can correctly learn to identify logical relationships such as
entailment and contradiction using these representations. In our first set of
experiments, we generate artificial data from a logical grammar and use it to
evaluate the models' ability to learn to handle basic relational reasoning,
recursive structures, and quantification. We then evaluate the models on the
more natural SICK challenge data. Both models perform competitively on the SICK
data and generalize well in all three experiments on simulated data, suggesting
that they can learn suitable representations for logical inference in natural
language
A Type-coherent, Expressive Representation as an Initial Step to Language Understanding
A growing interest in tasks involving language understanding by the NLP
community has led to the need for effective semantic parsing and inference.
Modern NLP systems use semantic representations that do not quite fulfill the
nuanced needs for language understanding: adequately modeling language
semantics, enabling general inferences, and being accurately recoverable. This
document describes underspecified logical forms (ULF) for Episodic Logic (EL),
which is an initial form for a semantic representation that balances these
needs. ULFs fully resolve the semantic type structure while leaving issues such
as quantifier scope, word sense, and anaphora unresolved; they provide a
starting point for further resolution into EL, and enable certain structural
inferences without further resolution. This document also presents preliminary
results of creating a hand-annotated corpus of ULFs for the purpose of training
a precise ULF parser, showing a three-person pairwise interannotator agreement
of 0.88 on confident annotations. We hypothesize that a divide-and-conquer
approach to semantic parsing starting with derivation of ULFs will lead to
semantic analyses that do justice to subtle aspects of linguistic meaning, and
will enable construction of more accurate semantic parsers.Comment: Accepted for publication at The 13th International Conference on
Computational Semantics (IWCS 2019
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
- ā¦