14 research outputs found
Data-Oriented Language Processing. An Overview
During the last few years, a new approach to language processing has started
to emerge, which has become known under various labels such as "data-oriented
parsing", "corpus-based interpretation", and "tree-bank grammar" (cf. van den
Berg et al. 1994; Bod 1992-96; Bod et al. 1996a/b; Bonnema 1996; Charniak
1996a/b; Goodman 1996; Kaplan 1996; Rajman 1995a/b; Scha 1990-92; Sekine &
Grishman 1995; Sima'an et al. 1994; Sima'an 1995-96; Tugwell 1995). This
approach, which we will call "data-oriented processing" or "DOP", embodies the
assumption that human language perception and production works with
representations of concrete past language experiences, rather than with
abstract linguistic rules. The models that instantiate this approach therefore
maintain large corpora of linguistic representations of previously occurring
utterances. When processing a new input utterance, analyses of this utterance
are constructed by combining fragments from the corpus; the
occurrence-frequencies of the fragments are used to estimate which analysis is
the most probable one.
In this paper we give an in-depth discussion of a data-oriented processing
model which employs a corpus of labelled phrase-structure trees. Then we review
some other models that instantiate the DOP approach. Many of these models also
employ labelled phrase-structure trees, but use different criteria for
extracting fragments from the corpus or employ different disambiguation
strategies (Bod 1996b; Charniak 1996a/b; Goodman 1996; Rajman 1995a/b; Sekine &
Grishman 1995; Sima'an 1995-96); other models use richer formalisms for their
corpus annotations (van den Berg et al. 1994; Bod et al., 1996a/b; Bonnema
1996; Kaplan 1996; Tugwell 1995).Comment: 34 pages, Postscrip
Evaluation of the NLP Components of the OVIS2 Spoken Dialogue System
The NWO Priority Programme Language and Speech Technology is a 5-year
research programme aiming at the development of spoken language information
systems. In the Programme, two alternative natural language processing (NLP)
modules are developed in parallel: a grammar-based (conventional, rule-based)
module and a data-oriented (memory-based, stochastic, DOP) module. In order to
compare the NLP modules, a formal evaluation has been carried out three years
after the start of the Programme. This paper describes the evaluation procedure
and the evaluation results. The grammar-based component performs much better
than the data-oriented one in this comparison.Comment: Proceedings of CLIN 9
Vector Symbolic Architectures answer Jackendoff's challenges for cognitive neuroscience
Jackendoff (2002) posed four challenges that linguistic combinatoriality and
rules of language present to theories of brain function. The essence of these
problems is the question of how to neurally instantiate the rapid construction
and transformation of the compositional structures that are typically taken to
be the domain of symbolic processing. He contended that typical connectionist
approaches fail to meet these challenges and that the dialogue between
linguistic theory and cognitive neuroscience will be relatively unproductive
until the importance of these problems is widely recognised and the challenges
answered by some technical innovation in connectionist modelling. This paper
claims that a little-known family of connectionist models (Vector Symbolic
Architectures) are able to meet Jackendoff's challenges.Comment: This is a slightly updated version of the paper presented at the
Joint International Conference on Cognitive Science, 13-17 July 2003,
University of New South Wales, Sydney, Australia. 6 page
Robust Grammatical Analysis for Spoken Dialogue Systems
We argue that grammatical analysis is a viable alternative to concept
spotting for processing spoken input in a practical spoken dialogue system. We
discuss the structure of the grammar, and a model for robust parsing which
combines linguistic sources of information and statistical sources of
information. We discuss test results suggesting that grammatical processing
allows fast and accurate processing of spoken input.Comment: Accepted for JNL
Habeant Corpusâthey should have the body. Tools learners have the right to use
GrĂące Ă des outils informatiques rapides, puissants, peu onĂ©reux et aisĂ©ment accessibles, lâutilisation des corpus a vu une vĂ©ritable explosion au cours des vingt derniĂšres annĂ©es. Dans le domaine de lâapprentissage des langues Ă©trangĂšres, cependant, lâexploitation des corpus est essentiellement le fait des chercheurs, des auteurs de manuels et des enseignants, tandis que les bĂ©nĂ©fices que les apprenants retirent de ces avancĂ©es sont la plupart du temps indirects. Rares, en effet, sont les enseignants qui permettent Ă leurs Ă©tudiants un accĂšs direct aux corpus. Cet article dĂ©fend lâidĂ©e que rien ne sâoppose Ă lâutilisation des corpus au moins par des apprenants « avancĂ©s » et que le fait dâencourager cette dĂ©marche active comporte des avantages considĂ©rables. AprĂšs avoir dĂ©fini briĂšvement la logique de lâapproche prĂ©sentĂ©e, nous dĂ©crirons un cursus dâanglais dans lequel nous demandons aux apprenants dâappliquer les techniques dâanalyse de corpus Ă un corpus existant ou confectionnĂ© par leurs soins. Nous dĂ©crirons ensuite les productions de nos propres Ă©tudiants en utilisant les mĂȘmes techniques et outils, disponibles gratuitement sur Internet, et qui ne nĂ©cessitent quâun degrĂ© minimal de maĂźtrise de lâinformatique.With the advent of fast, powerful, cheap and accessible computer tools, the use of corpora has exploded in the last 20 years. In the field of language learning, however, their use is mainly restricted to researchers, course writers and teachers, while the benefits to the learner are largely second hand: rare is the teacher who allows a class direct access to corpus methodology. This paper argues that there is no reason not to trust at least advanced learners with corpus tools, and that there are significant advantages to encouraging a hands-on approach. After outlining the rationale underpinning this approach, we describe an English course where learners are required to apply corpus techniques to an existing corpus or one of their own devising. We then go on to describe our studentsâ own productions, using only corpus techniques and tools used by the learners themselves, all freely available on the internet and requiring minimal training
The Baby project: processing character patterns in textual representations of language.
This thesis describes an investigation into a proposed theory of AI. The theory postulates that a machine can be programmed to predict aspects of human behaviour by selecting and processing stored, concrete examples of previously experienced patterns of behaviour. Validity is tested in the domain of natural language. Externalisations that model the resulting theory of NLP entail fuzzy
components. Fuzzy formalisms may exhibit inaccuracy and/or over productivity. A research strategy is developed, designed to investigate this aspect of the theory. The strategy includes two experimental hypotheses designed to test, 1) whether the model can process simple language
interaction, and 2) the effect of fuzzy processes on such language interaction. Experimental design requires three implementations, each with progressive degrees of fuzziness in their processes. They are respectively named: Nonfuzz Babe, CorrBab and FuzzBabe. Nonfuzz Babe is used to test the first hypothesis and all three implementations are used to test the second hypothesis. A system description is presented for Nonfuzz Babe. Testing the first hypothesis provides results that show NonfuzzBabe is able to process simple language interaction. A system description for CorrBabe and FuzzBabe is presented. Testing the second hypothesis, provides results that show a positive
correlation between degree of fuzzy processes and improved simple language performance. FuzzBabe's ability to process more complex language interaction is then investigated and model-intrinsic limitations are found. Research to overcome this problem is designed to illustrate the potential of externalisation of the theory and is conducted less rigorously than previous part of this investigation. Augmenting FuzzBabe to include fuzzy evaluation of non-pattern elements of interaction is hypothesised as a possible solution. The term FuzzyBaby was coined for augmented implementation. Results of a pilot study designed to measure FuzzyBaby's reading comprehension
are given. Little research has been conducted that investigates NLP by the fuzzy processing of concrete patterns in language. Consequently, it is proposed that this research contributes to the intellectual disciplines of NLP and AI in general
An evolutionary algorithm approach to poetry generation
Institute for Communicating and Collaborative SystemsPoetry is a unique artifact of the human language faculty, with its defining feature being a
strong unity between content and form. Contrary to the opinion that the automatic generation
of poetry is a relatively easy task, we argue that it is in fact an extremely difficult task that
requires intelligence, world and linguistic knowledge, and creativity.
We propose a model of poetry generation as a state space search problem, where a goal state is
a text that satisfies the three properties of meaningfulness, grammaticality, and poeticness.
We argue that almost all existing work on poetry generation only properly addresses a subset
of these properties.
In designing a computational approach for solving this problem, we draw upon the wealth of
work in natural language generation (NLG). Although the emphasis of NLG research is on the
generation of informative texts, recent work has highlighted the need for more flexible models
which can be cast as one end of a spectrum of search sophistication, where the opposing end
is the deterministically goal-directed planning of traditional NLG. We propose satisfying the
properties of poetry through the application to NLG of evolutionary algorithms (EAs), a wellstudied heuristic search method.
MCGONAGALL is our implemented instance of this approach. We use a linguistic representation
based on Lexicalized Tree Adjoining Grammar (LTAG) that we argue is appropriate for
EA-based NLG. Several genetic operators are implemented, ranging from baseline operators
based on LTAG syntactic operations to heuristic semantic goal-directed operators. Two evaluation
functions are implemented: one that measures the isomorphism between a solutionâs
stress pattern and a target metre using the edit distance algorithm, and one that measures the
isomorphism between a solutionâs propositional semantics and a target semantics using structural
similarity metrics.
We conducted an empirical study using MCGONAGALL to test the validity of employing EAs
in solving the search problem, and to test whether our evaluation functions adequately capture
the notions of semantic and metrical faithfulness. We conclude that our use of EAs offers
an innovative approach to flexible NLG, as demonstrated by its successful application to the
poetry generation task
Contextually-Dependent Lexical Semantics
Institute for Communicating and Collaborative SystemsThis thesis is an investigation of phenomena at the interface between syntax, semantics,
and pragmatics, with the aim of arguing for a view of semantic interpretation as lexically driven
yet contextually dependent. I examine regular, generative processes which operate
over the lexicon to induce verbal sense shifts, and discuss the interaction of these processes
with the linguistic or discourse context. I concentrate on phenomena where only an interaction
between all three linguistic knowledge sources can explain the constraints on verb
use: conventionalised lexical semantic knowledge constrains productive syntactic processes,
while pragmatic reasoning is both constrained by and constrains the potential interpretations
given to certain verbs. The phenomena which are closely examined are the behaviour of
PP sentential modifiers (specifically dative and directional PPs) with respect to the lexical
semantic representation of the verb phrases they modify, resultative constructions, and logical
metonymy.
The analysis is couched in terms of a lexical semantic representation drawing on Davis
(1995), Jackendoff (1983, 1990), and Pustejovsky (1991, 1995) which aims to capture âlinguistically
relevantâ components of meaning. The representation is shown to have utility for
modeling of the interaction between the syntactic form of an utterance and its meaning.
I introduce a formalisation of the representation within the framework of Head Driven
Phrase Structure Grammar (Pollard and Sag 1994), and rely on the model of discourse
coherence proposed by Lascarides and Asher (1992), Discourse in Commonsense Entailment.
I furthermore discuss the implications of the contextual dependency of semantic interpretation
for lexicon design and computational processing in Natural Language Understanding
systems