5 research outputs found
ΠΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΠΎΡΠ²ΠΎΠ΅Π½ΠΈΡ ΡΠ·ΡΠΊΠ° ΠΊ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π·Π°Π΄Π°ΡΠΈ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ ΠΌΠ°Π»ΡΡ ΡΠ·ΡΠΊΠΎΠ²
The problem of building a computer model of a small language was under solution. The relevance of this task is due to the following considerations: the need to eliminate the information inequality between speakers of different languages; the need for new tools for the study of poorly understood languages, as well as innovative approaches to language modeling in the low-resource context; the problem of supporting and developing small languages.There are three main objectives in solving the problem of small natural language processing at the stage of describing the problem situation: to justify the problem of modeling language in the context of resource scarcity as a special task in the field of natural languages processing, to review the literature on the relevant topic, to develop the concept of language acquisition model with a relatively small number of available resources. Computer modeling techniques using neural networks, semi-supervised learning and reinforcement learning were involved.The paper provides a review of the literature on modeling the learning of vocabulary, morphology, and grammar of a child's native language. Based on the current understanding of the language acquisition and existing computer models of this process, the architecture of the system of small language processing, which is taught through modeling of ontogenesis, is proposed. The main components of the system and the principles of their interaction are highlighted. The system is based on a module built on the basis of modern dialogical language models and taught in some rich-resources language (e.g., English). During training, an intermediate layer is used which represents statements in some abstract form, for example, in the symbols of formal semantics. The relationship between the formal recording of utterances and their translation into the target low-resource language is learned by modeling the child's acquisition of vocabulary and grammar of the language. One of components stands for the non-linguistic context in which language learning takes place.This article explores the problem of modeling small languages. A detailed substantiation of the relevance of modeling small languages is given: the social significance of the problem is noted, the benefits for linguistics, ethnography, ethnology and cultural anthropology are shown. The ineffectiveness of approaches applied to large languages in conditions of aβ―lack of resources is noted. A model of language learning by means of ontogenesis simulation is proposed, which is based both on the results obtained in the field of computer modeling and on the data of psycholinguistics.Π Π΅ΡΠ°Π΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΡ ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΠΌΠ°Π»ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ°. ΠΠ΅ Π°ΠΊΡΡΠ°Π»ΡΠ½ΠΎΡΡΡ ΠΎΠ±ΡΡΠ»ΠΎΠ²Π»Π΅Π½Π° Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎΡΡΡΡ ΡΡΡΡΠ°Π½Π΅Π½ΠΈΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠ³ΠΎ Π½Π΅ΡΠ°Π²Π΅Π½ΡΡΠ²Π° ΠΌΠ΅ΠΆΠ΄Ρ Π½ΠΎΡΠΈΡΠ΅Π»ΡΠΌΠΈ ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ
ΡΠ·ΡΠΊΠΎΠ², Π²ΠΎΡΡΡΠ΅Π±ΠΎΠ²Π°Π½Π½ΠΎΡΡΡΡ Π½ΠΎΠ²ΡΡ
ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠΎΠ² Π΄Π»Ρ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΡ ΠΌΠ°Π»ΠΎΠΈΠ·ΡΡΠ΅Π½Π½ΡΡ
ΡΠ·ΡΠΊΠΎΠ² ΠΈ ΠΈΠ½Π½ΠΎΠ²Π°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ² ΠΊ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΡΠ·ΡΠΊΠ° Π² ΡΡΠ»ΠΎΠ²ΠΈΡΡ
Π΄Π΅ΡΠΈΡΠΈΡΠ° ΡΠ΅ΡΡΡΡΠΎΠ², Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎΡΡΡΡ ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΠΈ ΠΈ ΡΠ°Π·Π²ΠΈΡΠΈΡ ΡΠ·ΡΠΊΠΎΠ² ΠΌΠ°Π»ΡΡ
Π½Π°ΡΠΎΠ΄ΠΎΠ².ΠΡΠΈ ΡΠ΅ΡΠ΅Π½ΠΈΠΈ Π·Π°Π΄Π°ΡΠΈ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ ΠΌΠ°Π»ΡΡ
ΡΠ·ΡΠΊΠΎΠ² Π½Π° ΡΡΠ°ΠΏΠ΅ ΠΎΠΏΠΈΡΠ°Π½ΠΈΡ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ½ΠΎΠΉ ΡΠΈΡΡΠ°ΡΠΈΠΈ ΠΏΡΠ΅ΡΠ»Π΅Π΄ΡΡΡΡΡ ΡΡΠΈ ΠΎΡΠ½ΠΎΠ²Π½ΡΠ΅ ΡΠ΅Π»ΠΈ: ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΡΠ·ΡΠΊΠ° Π² ΡΡΠ»ΠΎΠ²ΠΈΡΡ
Π΄Π΅ΡΠΈΡΠΈΡΠ° ΡΠ΅ΡΡΡΡΠΎΠ² ΠΊΠ°ΠΊ ΠΎΡΠΎΠ±ΠΎΠΉ Π·Π°Π΄Π°ΡΠΈ Π² ΡΡΠ΅ΡΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΡΠ·ΡΠΊΠΎΠ², ΠΎΠ±Π·ΠΎΡ Π»ΠΈΡΠ΅ΡΠ°ΡΡΡΡ ΠΏΠΎ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΡΡΠ΅ΠΉ ΡΠ΅ΠΌΠ΅ ΠΈ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠΊΠ° ΠΊΠΎΠ½ΡΠ΅ΠΏΡΠΈΠΈ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΡΠ²ΠΎΠ΅Π½ΠΈΡ ΡΠ·ΡΠΊΠ° Ρ ΠΎΡΠ½ΠΎΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΠΌΠ°Π»ΡΠΌ ΡΠΈΡΠ»ΠΎΠΌ Π΄ΠΎΡΡΡΠΏΠ½ΡΡ
ΡΠ΅ΡΡΡΡΠΎΠ². ΠΡΠΏΠΎΠ»ΡΠ·ΡΡΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ Ρ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π½Π΅ΠΉΡΠΎΠ½Π½ΡΡ
ΡΠ΅ΡΠ΅ΠΉ, ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ Ρ ΡΠ°ΡΡΠΈΡΠ½ΡΠΌ ΠΏΡΠΈΠ²Π»Π΅ΡΠ΅Π½ΠΈΠ΅ΠΌ ΡΡΠΈΡΠ΅Π»Ρ ΠΈ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ Ρ ΠΏΠΎΠ΄ΠΊΡΠ΅ΠΏΠ»Π΅Π½ΠΈΠ΅ΠΌ.ΠΒ ΡΠ°Π±ΠΎΡΠ΅Β ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½ ΠΎΠ±Π·ΠΎΡΒ Π»ΠΈΡΠ΅ΡΠ°ΡΡΡΡ, ΠΏΠΎΡΠ²ΡΡΠ΅Π½Π½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡΒ ΠΈΠ·ΡΡΠ΅Π½ΠΈΡΒ Π»Π΅ΠΊΡΠΈΠΊΠΈ,Β ΠΌΠΎΡΡΠΎΠ»ΠΎΠ³ΠΈΠΈ ΠΈ Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠΈ ΡΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΡΠ΅Π±Π΅Π½ΠΊΠΎΠΌ. ΠΠ° ΠΎΡΠ½ΠΎΠ²Π°Π½ΠΈΠΈ ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ
ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠΉ ΠΎ Ρ
ΠΎΠ΄Π΅ ΠΈΠ·ΡΡΠ΅Π½ΠΈΡ ΡΠ·ΡΠΊΠ° ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° Π°ΡΡ
ΠΈΡΠ΅ΠΊΡΡΡΠ° ΡΠΈΡΡΠ΅ΠΌΡ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ ΠΌΠ°Π»ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ°, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΡΠΈ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠΈ ΠΎΠΏΠΈΡΠ°Π΅ΡΡΡ Π½Π° ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ½ΠΎΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΠ½ΡΠΎΠ³Π΅Π½Π΅Π·Π°. ΠΡΠ΄Π΅Π»Π΅Π½Ρ ΠΎΡΠ½ΠΎΠ²Π½ΡΠ΅ ΠΊΠΎΠΌΠΏΠΎΠ½Π΅Π½ΡΡ ΡΠΈΡΡΠ΅ΠΌΡ ΠΈβ―ΠΏΡΠΈΠ½ΡΠΈΠΏΡ ΠΈΡ
Π²Π·Π°ΠΈΠΌΠΎΠ΄Π΅ΠΉΡΡΠ²ΠΈΡ. Π ΠΎΡΠ½ΠΎΠ²Π΅ ΡΠΈΡΡΠ΅ΠΌΡ Π»Π΅ΠΆΠΈΡ ΠΌΠΎΠ΄ΡΠ»Ρ, ΠΏΠΎΡΡΡΠΎΠ΅Π½Π½ΡΠΉ Π½Π° Π±Π°Π·Π΅ ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ
Π΄ΠΈΠ°Π»ΠΎΠ³ΠΎΠ²ΡΡ
ΡΠ·ΡΠΊΠΎΠ²ΡΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉΒ ΠΈΒ ΠΎΠ±ΡΡΠ΅Π½Π½ΡΠΉ Π½Π°Β ΠΊΠ°ΠΊΠΎΠΌ-Π»ΠΈΠ±ΠΎ ΠΊΡΡΠΏΠ½ΠΎΠΌ ΡΠ·ΡΠΊΠ΅,Β Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΎΠΌ. ΠΡΠΈβ―ΠΎΠ±ΡΡΠ΅Π½ΠΈΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ ΠΏΡΠΎΠΌΠ΅ΠΆΡΡΠΎΡΠ½ΡΠΉ ΡΠ»ΠΎΠΉ, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ Π²ΡΡΠΊΠ°Π·ΡΠ²Π°Π½ΠΈΡ Π² Π½Π΅ΠΊΠΎΡΠΎΡΠΎΠΌ Π°Π±ΡΡΡΠ°ΠΊΡΠ½ΠΎΠΌ Π²ΠΈΠ΄Π΅, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, Π² ΡΠΈΠΌΠ²ΠΎΠ»Π°Ρ
ΡΠΎΡΠΌΠ°Π»ΡΠ½ΠΎΠΉ ΡΠ΅ΠΌΠ°Π½ΡΠΈΠΊΠΈ. Π‘ΠΎΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠ΅ ΠΌΠ΅ΠΆΠ΄Ρ ΡΠΎΡΠΌΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΠΏΠΈΡΡΡ Π²ΡΡΠΊΠ°Π·ΡΠ²Π°Π½ΠΈΠΉ ΠΈ ΠΈΡ
ΠΏΠ΅ΡΠ΅Π²ΠΎΠ΄ΠΎΠΌ Π½Π° ΡΠ΅Π»Π΅Π²ΠΎΠΉ ΠΌΠ°Π»ΡΠΉ ΡΠ·ΡΠΊ ΠΈΠ·ΡΡΠ°Π΅ΡΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠΌ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΠΏΡΠΎΡΠ΅ΡΡΠ° ΡΡΠ²ΠΎΠ΅Π½ΠΈΡ Π»Π΅ΠΊΡΠΈΠΊΠΈ ΠΈ Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠΈ ΡΠ·ΡΠΊΠ° ΡΠ΅Π±Π΅Π½ΠΊΠΎΠΌ. ΠΡΠ΄Π΅Π»ΡΠ½ΡΠΉ ΠΊΠΎΠΌΠΏΠΎΠ½Π΅Π½Ρ ΠΈΠΌΠΈΡΠΈΡΡΠ΅Ρ Π½Π΅ΡΠ·ΡΠΊΠΎΠ²ΠΎΠΉ ΠΊΠΎΠ½ΡΠ΅ΠΊΡΡ, Π² ΠΊΠΎΡΠΎΡΠΎΠΌ ΠΏΡΠΎΠΈΡΡ
ΠΎΠ΄ΠΈΡ ΠΈΠ·ΡΡΠ΅Π½ΠΈΠ΅ ΡΠ·ΡΠΊΠ°.Π ΡΡΠ°ΡΡΠ΅ ΠΈΡΡΠ»Π΅Π΄ΡΠ΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΠΌΠ°Π»ΡΡ
ΡΠ·ΡΠΊΠΎΠ². ΠΠ°Π½ΠΎ ΠΏΠΎΠ΄ΡΠΎΠ±Π½ΠΎΠ΅ ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½ΠΈΠ΅ Π°ΠΊΡΡΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΠΌΠ°Π»ΡΡ
ΡΠ·ΡΠΊΠΎΠ²: ΠΏΠΎΠΊΠ°Π·Π°Π½Π° ΡΠΎΡΠΈΠ°Π»ΡΠ½Π°Ρ Π·Π½Π°ΡΠΈΠΌΠΎΡΡΡ ΡΡΠΎΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ, ΠΏΠΎΠ»ΡΠ·Π° Π΅Π΅ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π΄Π»Ρ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΠΊΠΈ, ΡΡΠ½ΠΎΠ³ΡΠ°ΡΠΈΠΈ, ΡΡΠ½ΠΎΠ»ΠΎΠ³ΠΈΠΈ ΠΈ ΠΊΡΠ»ΡΡΡΡΠ½ΠΎΠΉ Π°Π½ΡΡΠΎΠΏΠΎΠ»ΠΎΠ³ΠΈΠΈ. ΠΡΠΌΠ΅ΡΠ΅Π½Π° Π½Π΅ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ², ΠΏΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌΡΡ
ΠΊ ΠΊΡΡΠΏΠ½ΡΠΌ ΡΠ·ΡΠΊΠ°ΠΌ, Π² ΡΡΠ»ΠΎΠ²ΠΈΡΡ
Π΄Π΅ΡΠΈΡΠΈΡΠ° ΡΠ΅ΡΡΡΡΠΎΠ². ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° ΠΌΠΎΠ΄Π΅Π»Ρ ΠΈΠ·ΡΡΠ΅Π½ΠΈΡ ΡΠ·ΡΠΊΠ° Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΈΠΌΠΈΡΠ°ΡΠΈΠΈ ΠΎΠ½ΡΠΎΠ³Π΅Π½Π΅Π·Π°, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΎΠΏΠΈΡΠ°Π΅ΡΡΡ ΠΊΠ°ΠΊ Π½Π° ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΡΠ΅ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ Π² ΠΎΠ±Π»Π°ΡΡΠΈ ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ, ΡΠ°ΠΊ ΠΈ Π½Π° Π΄Π°Π½Π½ΡΠ΅ ΠΏΡΠΈΡ
ΠΎΠ»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΠΊΠΈ
Introducing Meta-analysis in the Evaluation of Computational Models of Infant Language Development
Computational models of child language development can help us understand the cognitive underpinnings of the language learning process, which occurs along several linguistic levels at once (e.g., prosodic and phonological). However, in light of the replication crisis, modelers face the challenge of selecting representative and consolidated infant data. Thus, it is desirable to have evaluation methodologies that could account for robust empirical reference data, across multiple infant capabilities. Moreover, there is a need for practices that can compare developmental trajectories of infants to those of models as a function of language experience and development. The present study aims to take concrete steps to address these needs by introducing the concept of comparing models with large-scale cumulative empirical data from infants, as quantified by meta-analyses conducted across a large number of individual behavioral studies. We formalize the connection between measurable model and human behavior, and then present a conceptual framework for meta-analytic evaluation of computational models. We exemplify the meta-analytic model evaluation approach with two modeling experiments on infant-directed speech preference and native/non-native vowel discrimination.Peer reviewe
TOWARDS THE GROUNDING OF ABSTRACT CATEGORIES IN COGNITIVE ROBOTS
The grounding of language in humanoid robots is a fundamental problem, especially
in social scenarios which involve the interaction of robots with human beings. Indeed,
natural language represents the most natural interface for humans to interact
and exchange information about concrete entities like KNIFE, HAMMER and abstract
concepts such as MAKE, USE. This research domain is very important not
only for the advances that it can produce in the design of human-robot communication
systems, but also for the implication that it can have on cognitive science.
Abstract words are used in daily conversations among people to describe events and
situations that occur in the environment. Many scholars have suggested that the
distinction between concrete and abstract words is a continuum according to which
all entities can be varied in their level of abstractness.
The work presented herein aimed to ground abstract concepts, similarly to concrete
ones, in perception and action systems. This permitted to investigate how different
behavioural and cognitive capabilities can be integrated in a humanoid robot in
order to bootstrap the development of higher-order skills such as the acquisition of
abstract words. To this end, three neuro-robotics models were implemented.
The first neuro-robotics experiment consisted in training a humanoid robot to perform
a set of motor primitives (e.g. PUSH, PULL, etc.) that hierarchically combined
led to the acquisition of higher-order words (e.g. ACCEPT, REJECT). The
implementation of this model, based on a feed-forward artificial neural networks,
permitted the assessment of the training methodology adopted for the grounding of
language in humanoid robots.
In the second experiment, the architecture used for carrying out the first study
was reimplemented employing recurrent artificial neural networks that enabled the
temporal specification of the action primitives to be executed by the robot. This
permitted to increase the combinations of actions that can be taught to the robot
for the generation of more complex movements.
For the third experiment, a model based on recurrent neural networks that integrated
multi-modal inputs (i.e. language, vision and proprioception) was implemented for
the grounding of abstract action words (e.g. USE, MAKE). Abstract representations
of actions ("one-hot" encoding) used in the other two experiments, were replaced
with the joints values recorded from the iCub robot sensors.
Experimental results showed that motor primitives have different activation patterns
according to the action's sequence in which they are embedded. Furthermore, the
performed simulations suggested that the acquisition of concepts related to abstract
action words requires the reactivation of similar internal representations activated
during the acquisition of the basic concepts, directly grounded in perceptual and
sensorimotor knowledge, contained in the hierarchical structure of the words used
to ground the abstract action words.This study was financed by the EU project RobotDoC (235065) from the Seventh
Framework Programme (FP7), Marie Curie Actions Initial Training Network
La estructuraciΓ³n temΓ‘tica en inglΓ©s y espaΓ±ol: anotaciΓ³n contrastiva de un corpus bilingΓΌe para aplicaciones lingΓΌΓsticas y computacionales
Tesis inΓ©dita de la Universidad Complutense de Madrid, Facultad de FilologΓa, Departamento de FilologΓa Inglesa, leΓda el 04-12-2015Thematization is recognized as a fundamental phenomenon in the construction of messages and texts by di erent linguistic schools. This location within a text privileges the elements that guide the reader in the orientation and interpretation of discourse at di erent levels. Thematizing a linguistic unit by locating it in the rst-initial position of a clause, paragraph, or text, confers upon it a special status: a signal of the organizational strategy which characterizes di erent text types playing a role as a variable in the distinction of registers, text types and genres. However, in spite of the importance of the study of thematization for message and textual structuring, to date there are no linguistic studies that have undertook the task of validating its aspects in a comparative manner, either for linguistic or computational purposes. This study, therefore, lls a research gap by implementing a methodology based on contrastive corpus annotation, which allows to empirically validate aspects of the phenomenon of Thematization in English and Spanish, it also seeks to develop a bilingual English-Spanish comparable corpus of newspaper texts automatically annotated with thematic features at clausal and discourse levels. The empirically validated categories (Thematic Field and its elements: Textual Theme, Interpersonal Theme, PreHead and Head) are used to annotate a larger corpus of three newspaper genres news reports, editorials and letters to the editor in terms of thematic choices. This characterization, reveals interesting results, such as the use of genre-speci c strategies in thematic position. In addition, the thesis investigates the possibility to automate the annotation of thematic features in the bilingual corpus through the development of a set of JAVA rules implemented in GATE. It also shows the e cacy of this method in comparison with the manual annotation results...Depto. de Estudios Ingleses: LingΓΌΓstica y LiteraturaFac. de FilologΓaTRUEunpu