4,765 research outputs found
A Maximum-Entropy Partial Parser for Unrestricted Text
This paper describes a partial parser that assigns syntactic structures to
sequences of part-of-speech tags. The program uses the maximum entropy
parameter estimation method, which allows a flexible combination of different
knowledge sources: the hierarchical structure, parts of speech and phrasal
categories. In effect, the parser goes beyond simple bracketing and recognises
even fairly complex structures. We give accuracy figures for different
applications of the parser.Comment: 9 pages, LaTe
A Machine-Aided Approach to Intelligent Index Generation
Back-of-the-book indexing is the process of generating a list of relevant terms, sub-terms and cross-references from a corpus and providing the user with corresponding page references.
Several cognitive tasks are necessary to produce a good index, and are performed primarily by the human indexer. Indexing has become somewhat automated through computer applications, which at best generate a concordance, and exist to reduce the mundane portions of the process. However, none of these tools determines which terms to index, nor do they capture context-sensitive information about terms and their relationships. Human indexers perform these time-consuming tasks.
The challenge is to develop software that bridges the gap between computerized concordances and manual indexing. The prototype application described herein is unique in its ability to incorporate the intelligent portions of the process. Because of this, it provides a robust draft index that a human indexer can refine in a fraction of the time
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation
We introduce the multiresolution recurrent neural network, which extends the
sequence-to-sequence framework to model natural language generation as two
parallel discrete stochastic processes: a sequence of high-level coarse tokens,
and a sequence of natural language tokens. There are many ways to estimate or
learn the high-level coarse tokens, but we argue that a simple extraction
procedure is sufficient to capture a wealth of high-level discourse semantics.
Such procedure allows training the multiresolution recurrent neural network by
maximizing the exact joint log-likelihood over both sequences. In contrast to
the standard log- likelihood objective w.r.t. natural language tokens (word
perplexity), optimizing the joint log-likelihood biases the model towards
modeling high-level abstractions. We apply the proposed model to the task of
dialogue response generation in two challenging domains: the Ubuntu technical
support domain, and Twitter conversations. On Ubuntu, the model outperforms
competing approaches by a substantial margin, achieving state-of-the-art
results according to both automatic evaluation metrics and a human evaluation
study. On Twitter, the model appears to generate more relevant and on-topic
responses according to automatic evaluation metrics. Finally, our experiments
demonstrate that the proposed model is more adept at overcoming the sparsity of
natural language and is better able to capture long-term structure.Comment: 21 pages, 2 figures, 10 table
What Works Better? A Study of Classifying Requirements
Classifying requirements into functional requirements (FR) and non-functional
ones (NFR) is an important task in requirements engineering. However, automated
classification of requirements written in natural language is not
straightforward, due to the variability of natural language and the absence of
a controlled vocabulary. This paper investigates how automated classification
of requirements into FR and NFR can be improved and how well several machine
learning approaches work in this context. We contribute an approach for
preprocessing requirements that standardizes and normalizes requirements before
applying classification algorithms. Further, we report on how well several
existing machine learning methods perform for automated classification of NFRs
into sub-categories such as usability, availability, or performance. Our study
is performed on 625 requirements provided by the OpenScience tera-PROMISE
repository. We found that our preprocessing improved the performance of an
existing classification method. We further found significant differences in the
performance of approaches such as Latent Dirichlet Allocation, Biterm Topic
Modeling, or Naive Bayes for the sub-classification of NFRs.Comment: 7 pages, the 25th IEEE International Conference on Requirements
Engineering (RE'17
- …