263 research outputs found

    Workshop Proceedings of the 12th edition of the KONVENS conference

    Get PDF
    The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

    Advances in automatic terminology processing: methodology and applications in focus

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.The information and knowledge era, in which we are living, creates challenges in many fields, and terminology is not an exception. The challenges include an exponential growth in the number of specialised documents that are available, in which terms are presented, and the number of newly introduced concepts and terms, which are already beyond our (manual) capacity. A promising solution to this ‘information overload’ would be to employ automatic or semi-automatic procedures to enable individuals and/or small groups to efficiently build high quality terminologies from their own resources which closely reflect their individual objectives and viewpoints. Automatic terminology processing (ATP) techniques have already proved to be quite reliable, and can save human time in terminology processing. However, they are not without weaknesses, one of which is that these techniques often consider terms to be independent lexical units satisfying some criteria, when terms are, in fact, integral parts of a coherent system (a terminology). This observation is supported by the discussion of the notion of terms and terminology and the review of existing approaches in ATP presented in this thesis. In order to overcome the aforementioned weakness, we propose a novel methodology in ATP which is able to extract a terminology as a whole. The proposed methodology is based on knowledge patterns automatically extracted from glossaries, which we considered to be valuable, but overlooked resources. These automatically identified knowledge patterns are used to extract terms, their relations and descriptions from corpora. The extracted information can facilitate the construction of a terminology as a coherent system. The study also aims to discuss applications of ATP, and describes an experiment in which ATP is integrated into a new NLP application: multiplechoice test item generation. The successful integration of the system shows that ATP is a viable technology, and should be exploited more by other NLP applications

    Each book its own Babel:Conceptual unity and disunity in early modern natural philosophy

    Get PDF
    Natural philosophy changed quickly during the early modern period (1600-1800). Aristotelian philosophy was combated by Cartesian mechanicism, which was soon itself ousted by the Newtonian school. The development of new ideas within a scientific discipline is partially an issue of doing empirical research, in order to exclude positions and progress the field. However, it is also an issue of developing new concepts and a fitting language, in order to be able to express all these new positions being investigated. This second development however also implies that the differences between thinkers might grow too large - the languages in which they express their philosophy can become too different for them to have a meaningful discussion. In this dissertation I investigate, using algorithms that extract the meaning of words from texts, a few hundred texts from these three different school. I do this in order to see how they differ from each other conceptually, how the meaning of words can travel through lines of influence from author to author and how guarding the boundaries of a school and guarding the language they use, relate

    The automatic processing of multiword expressions in Irish

    Get PDF
    It is well-documented that Multiword Expressions (MWEs) pose a unique challenge to a variety of NLP tasks such as machine translation, parsing, information retrieval, and more. For low-resource languages such as Irish, these challenges can be exacerbated by the scarcity of data, and a lack of research in this topic. In order to improve handling of MWEs in various NLP tasks for Irish, this thesis will address both the lack of resources specifically targeting MWEs in Irish, and examine how these resources can be applied to said NLP tasks. We report on the creation and analysis of a number of lexical resources as part of this PhD research. Ilfhocail, a lexicon of Irish MWEs, is created through extract- ing MWEs from other lexical resources such as dictionaries. A corpus annotated with verbal MWEs in Irish is created for the inclusion of Irish in the PARSEME Shared Task 1.2. Additionally, MWEs were tagged in a bilingual EN-GA corpus for inclusion in experiments in machine translation. For the purposes of annotation, a categorisation scheme for nine categories of MWEs in Irish is created, based on combining linguistic analysis on these types of constructions and cross-lingual frameworks for defining MWEs. A case study in applying MWEs to NLP tasks is undertaken, with the exploration of incorporating MWE information while training Neural Machine Translation systems. Finally, the topic of automatic identification of Irish MWEs is explored, documenting the training of a system capable of automatically identifying Irish MWEs from a variety of categories, and the challenges associated with developing such a system. This research contributes towards a greater understanding of Irish MWEs and their applications in NLP, and provides a foundation for future work in exploring other methods for the automatic discovery and identification of Irish MWEs, and further developing the MWE resources described above

    A distributional investigation of German verbs

    Get PDF
    Diese Dissertation bietet eine empirische Untersuchung deutscher Verben auf der Grundlage statistischer Beschreibungen, die aus einem großen deutschen Textkorpus gewonnen wurden. In einem kurzen Überblick über linguistische Theorien zur lexikalischen Semantik von Verben skizziere ich die Idee, dass die Verbbedeutung wesentlich von seiner Argumentstruktur (der Anzahl und Art der Argumente, die zusammen mit dem Verb auftreten) und seiner Aspektstruktur (Eigenschaften, die den zeitlichen Ablauf des vom Verb denotierten Ereignisses bestimmen) abhängt. Anschließend erstelle ich statistische Beschreibungen von Verben, die auf diesen beiden unterschiedlichen Bedeutungsfacetten basieren. Insbesondere untersuche ich verbale Subkategorisierung, Selektionspräferenzen und Aspekt. Alle diese Modellierungsstrategien werden anhand einer gemeinsamen Aufgabe, der Verbklassifikation, bewertet. Ich zeige, dass im Rahmen von maschinellem Lernen erworbene Merkmale, die verbale lexikalische Aspekte erfassen, für eine Anwendung von Vorteil sind, die Argumentstrukturen betrifft, nämlich semantische Rollenkennzeichnung. Darüber hinaus zeige ich, dass Merkmale, die die verbale Argumentstruktur erfassen, bei der Aufgabe, ein Verb nach seiner Aspektklasse zu klassifizieren, gut funktionieren. Diese Ergebnisse bestätigen, dass diese beiden Facetten der Verbbedeutung auf grundsätzliche Weise zusammenhängen.This dissertation provides an empirical investigation of German verbs conducted on the basis of statistical descriptions acquired from a large corpus of German text. In a brief overview of the linguistic theory pertaining to the lexical semantics of verbs, I outline the idea that verb meaning is composed of argument structure (the number and types of arguments that co-occur with a verb) and aspectual structure (properties describing the temporal progression of an event referenced by the verb). I then produce statistical descriptions of verbs according to these two distinct facets of meaning: In particular, I examine verbal subcategorisation, selectional preferences, and aspectual type. All three of these modelling strategies are evaluated on a common task, automatic verb classification. I demonstrate that automatically acquired features capturing verbal lexical aspect are beneficial for an application that concerns argument structure, namely semantic role labelling. Furthermore, I demonstrate that features capturing verbal argument structure perform well on the task of classifying a verb for its aspectual type. These findings suggest that these two facets of verb meaning are related in an underlying way

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    Translationese indicators for human translation quality estimation (based on English-to-Russian translation of mass-media texts)

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Human translation quality estimation is a relatively new and challenging area of research, because human translation quality is notoriously more subtle and subjective than machine translation, which attracts much more attention and effort of the research community. At the same time, human translation is routinely assessed by education and certification institutions, as well as at translation competitions. Do the quality labels and scores generated from real-life quality judgments align well with objective properties of translations? This thesis puts this question to a test using machine learning methods. Conceptually, this research is built around a hypothesis that linguistic properties characteristic of translations, as a specific form of communication, can correlate with translation quality. This assumption is often made in translation studies but has never been put to a rigorous empirical test. Exploring translationese features in a quality estimation task can help identify quality-related trends in translational behaviour and provide data-driven insights into professionalism to improve training. Using translationese for quality estimation fits well with the concept of quality in translation studies, because it is essentially a document-level property. Linguistically-motivated translationese features are also more interpretable than popular distributed representations and can explain linguistic differences between quality categories in human translation. We investigated (i) an extended set of Universal Dependencies-based morphosyntactic features as well as two lexical feature sets capturing (ii) collocational properties of translations, and (iii) ratios of vocabulary items in various frequency bands along with entropy scores from n-gram models. To compare the performance of our feature sets in translationese classifications and in quality estimation tasks against other representations, the experiments were also run on tf-idf features, QuEst++ features and on contextualised embeddings from a range of pre-trained language models, including the state-of-the-art multilingual solution for machine translation quality estimation. Our major focus was on document-level prediction, however, where the labels and features allowed, the experiments were extended to the sentence level. The corpus used in this research includes English-to-Russian parallel subcorpora of student and professional translations of mass-media texts, and a register-comparable corpus of non-translations in the target language. Quality labels for various subsets of student translations come from a number of real-life settings: translation competitions, graded student translations, error annotations and direct assessment. We overview approaches to benchmarking quality in translation and provide a detailed description of our own annotation experiments. Of the three proposed translationese feature sets, morphosyntactic features, returned the best results on all tasks. In many settings they were secondary only to contextualised embeddings. At the same time, performance on various representations was contingent on the type of quality captured by quality labels/scores. Using the outcomes of machine learning experiments and feature analysis, we established that translationese properties of translations were not equality reflected by various labels and scores. For example, professionalism was much less related to translationese than expected. Labels from documentlevel holistic assessment demonstrated maximum support for our hypothesis: lower-ranking translations clearly exhibited more translationese. They bore more traces of mechanical translational behaviours associated with following source language patterns whenever possible, which led to the inflated frequencies of analytical passives, modal predicates, verbal forms, especially copula verbs and verbs in the finite form. As expected, lower-ranking translations were more repetitive and had longer, more complex sentences. Higher-ranking translations were indicative of greater skill in recognising and counteracting translationese tendencies. For document-level holistic labels as an approach to capture quality, translationese indicators might provide a valuable contribution to an effective quality estimation pipeline. However, error-based scores, and especially scores from sentence-level direct assessment, proved to be much less correlated by translationese and fluency issues, in general. This was confirmed by relatively low regression results across all representations that had access only to the target language side of the dataset, by feature analysis and by correlation between error-based scores and scores from direct assessment

    Each book its own Babel:Conceptual unity and disunity in early modern natural philosophy

    Get PDF
    Natural philosophy changed quickly during the early modern period (1600-1800). Aristotelian philosophy was combated by Cartesian mechanicism, which was soon itself ousted by the Newtonian school. The development of new ideas within a scientific discipline is partially an issue of doing empirical research, in order to exclude positions and progress the field. However, it is also an issue of developing new concepts and a fitting language, in order to be able to express all these new positions being investigated. This second development however also implies that the differences between thinkers might grow too large - the languages in which they express their philosophy can become too different for them to have a meaningful discussion. In this dissertation I investigate, using algorithms that extract the meaning of words from texts, a few hundred texts from these three different school. I do this in order to see how they differ from each other conceptually, how the meaning of words can travel through lines of influence from author to author and how guarding the boundaries of a school and guarding the language they use, relate

    A journey through learner language: tracking development using POS tag sequences in large-scale learner data

    Get PDF
    This PhD study comes at a cross-roads of SLA studies and corpus linguistics methodology, using a bottom-up data-first approach to throw light on second language development. Taking POS tag n-gram sequences as a starting point, searching the data from the outermost syntactic layer available in corpus tools, it is an investigation of grammatical development in learner language across the six proficiency levels in the 52-million-word CEFR-benchmarked quasi-longitudinal Cambridge Learner Corpus. It takes a mixed methods approach, first examining the frequency and distribution of POS tag sequences by level, identifying convergence and divergence, and secondly looking qualitatively at form-meaning mappings of sequences at differing levels. It seeks to observe if there are sequences which characterise levels and which might index the transition between levels. It investigates sequence use at a lexical and functional level and explores whether this can contribute to our understanding of how a generic repertoire of learner language develops. It aims to contribute to the theoretical debate by looking critically at how current theories of language development and description might account for learner language development. It responds to the call to look at largescale learner data, and benefits from privileged access to such longitudinal data, acknowledging the limitations of any corpus data and the need to triangulate across different datasets. It seeks to illustrate how L2 language use converges and diverges across proficiency levels and to investigate convergence and divergence between L1 and L2 usage.N
    corecore