Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Filosofía y Letras, Departamento de Lingüística, Lenguas Modernas, Lógica y Fª de la Ciencia y Tª de la Literatura y Literatura Comparada. Fecha de lectura: 10-12-2021Spanish se constructions constitute a linguistic phenomenon that challenges
Natural Language Processing (NLP) tasks such as part-of-speech or dependency
relation tagging. The three main reasons why se is a hurdling topic for NLP
are: rst, the high-frequency of appearance of se in Spanish; second, the nine
di erent syntactic constructions where se appears adding information of diverse
nature depending on the context; third, the lack of gender and number features
se displays that does not help se-type disambiguation. This thesis' main goal is
to improve the state-of-the-art results on automatic morphosyntactic se analysis
on the basis of two hypotheses: the grouping (GH) and the subcategorization
frame (SFH) hypotheses. This thesis proposes a new annotation scheme for se
that connects the di erent constructions through a transitivity gradient (Moreno
Cabrera, 2004). The new annotation scheme is applied on the SE-corpus, a
European Spanish corpus made of 3,100 sentences containing the word se. The
SE-corpus belongs to the news, leisure and daily life domain of CORPES XXI
(Real Academia Espa~nola, 2018) and it has been manually annotated as part
of this research work. The SE-corpus is used to train di erent models using
UDPipe1.2 to test whether the new annotation scheme can be learnt by the
neural networks that underlie the dependency parser. The resulting models are
evaluated on an additional gold standard test corpus made of 100 sentences
containing the form se. These sentences are obtained from CORPES XXI, too.
The best model yields a LAS F-score of 86.97 points and a UAS F-score of
89.65 points. Regarding se analysis, the best model yields a LAS F-score of
82.55 points and a UAS F-score of 98.16 points. The main contributions of this
thesis are: a new annotation scheme for se adapted to Universal Dependencies'
guidelines, manual annotation guidelines for Spanish se disambiguation, the raw
and annotated version of the SE-corpus and the best resulting mode