28 research outputs found

    Understanding the structure and meaning of Finnish texts: From corpus creation to deep language modelling

    Get PDF
    Natural Language Processing (NLP) is a cross-disciplinary field combining elements of computer science, artificial intelligence, and linguistics, with the objective of developing means for computational analysis, understanding or generation of human language. The primary aim of this thesis is to advance natural language processing in Finnish by providing more resources and investigating the most effective machine learning based practices for their use. The thesis focuses on NLP topics related to understanding the structure and meaning of written language, mainly concentrating on structural analysis (syntactic parsing) as well as exploring the semantic equivalence of statements that vary in their surface realization (paraphrase modelling). While the new resources presented in the thesis are developed for Finnish, most of the methodological contributions are language-agnostic, and the accompanying papers demonstrate the application and evaluation of these methods across multiple languages. The first set of contributions of this thesis revolve around the development of a state-of-the-art Finnish dependency parsing pipeline. Firstly, the necessary Finnish training data was converted to the Universal Dependencies scheme, integrating Finnish into this important treebank collection and establishing the foundations for Finnish UD parsing. Secondly, a novel word lemmatization method based on deep neural networks is introduced and assessed across a diverse set of over 50 languages. And finally, the overall dependency parsing pipeline is evaluated on a large number of languages, securing top ranks in two competitive shared tasks focused on multilingual dependency parsing. The overall outcome of this line of research is a parsing pipeline reaching state-of-the-art accuracy in Finnish dependency parsing, the parsing numbers obtained with the latest pre-trained language models approaching (at least near) human-level performance. The achievement of large language models in the area of dependency parsing— as well as in many other structured prediction tasks— brings up the hope of the large pre-trained language models genuinely comprehending language, rather than merely relying on simple surface cues. However, datasets designed to measure semantic comprehension in Finnish have been non-existent, or very scarce at the best. To address this limitation, and to reflect the general change of emphasis in the field towards task more semantic in nature, the second part of the thesis shifts its focus to language understanding through an exploration of paraphrase modelling. The second contribution of the thesis is the creation of a novel, large-scale, manually annotated corpus of Finnish paraphrases. A unique aspect of this corpus is that its examples have been manually extracted from two related text documents, with the objective of obtaining non-trivial paraphrase pairs valuable for training and evaluating various language understanding models on paraphrasing. We show that manual paraphrase extraction can yield a corpus featuring pairs that are both notably longer and less lexically overlapping than those produced through automated candidate selection, the current prevailing practice in paraphrase corpus construction. Another distinctive feature in the corpus is that the paraphrases are identified and distributed within their document context, allowing for richer modelling and novel tasks to be defined

    Handbook of Lexical Functional Grammar

    Get PDF
    Lexical Functional Grammar (LFG) is a nontransformational theory of linguistic structure, first developed in the 1970s by Joan Bresnan and Ronald M. Kaplan, which assumes that language is best described and modeled by parallel structures representing different facets of linguistic organization and information, related by means of functional correspondences. This volume has five parts. Part I, Overview and Introduction, provides an introduction to core syntactic concepts and representations. Part II, Grammatical Phenomena, reviews LFG work on a range of grammatical phenomena or constructions. Part III, Grammatical modules and interfaces, provides an overview of LFG work on semantics, argument structure, prosody, information structure, and morphology. Part IV, Linguistic disciplines, reviews LFG work in the disciplines of historical linguistics, learnability, psycholinguistics, and second language learning. Part V, Formal and computational issues and applications, provides an overview of computational and formal properties of the theory, implementations, and computational work on parsing, translation, grammar induction, and treebanks. Part VI, Language families and regions, reviews LFG work on languages spoken in particular geographical areas or in particular language families. The final section, Comparing LFG with other linguistic theories, discusses LFG work in relation to other theoretical approaches

    Morphological Doublets in Croatian: A multi-methodological analysis

    Get PDF
    The term morphological doubletism refers to a situation in language when there are two (or more) morphemes available for a single cell in an inflectional paradigm of a lexeme. Slavonic languages, with their rich inflectional systems, show particularly high levels of doubletism. In the present dissertation we analyse examples of doubletism in Croatian nominal paradigms. As shown by the dissertation’s subtitle, “a multi-methodological analysis”, we compare and contrast evidence obtained by various methods. First we conduct a corpus study to determine the frequency distributions of the doublet pairs in present-day Croatian. This analysis has shown that the distribution of the doublet pairs is not determined by any intra- or extra-linguistic factor, but that it is not completely random either. These distributions are later used in several additional studies, the purpose of which is to answer the question of how such forms are processed in speakers’ mental grammars. One of the analyses is a computational one, in which we try to reproduce a grammar of a Croatian speaker by using two memory based models (AM and TiMBL). The models were highly successful in producing the desired output without resorting to any rules or generalizations. We also report the results of three questionnaire studies, all of which show that native speakers are extremely sensitive to the language input they receive, in line with usage-based theories of language, as well as that mental grammars are gradient. The speakers’ ratings and production rates closely matched the proportions of the doublet pairs in the corpus. Furthermore, speakers distinguish between several levels of domination of one ending over another. When the domination of one form is weak, speakers resort to a different decision criterion, namely they look at the dominant ending of phonologically similar words

    Morphology and linguistic typology : on-line-proceedings of the Fourth Mediterranean Morphology Meeting (MMM4)21-23 September 2003

    No full text

    Optional categories in early French syntax : a developmental study of root infinitives and null arguments

    Get PDF
    LoC Class: PC2395, LoC Subject Headings: French language--Synta

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Proceedings of the Conference on Natural Language Processing 2010

    Get PDF
    This book contains state-of-the-art contributions to the 10th conference on Natural Language Processing, KONVENS 2010 (Konferenz zur Verarbeitung natĂĽrlicher Sprache), with a focus on semantic processing. The KONVENS in general aims at offering a broad perspective on current research and developments within the interdisciplinary field of natural language processing. The central theme draws specific attention towards addressing linguistic aspects ofmeaning, covering deep as well as shallow approaches to semantic processing. The contributions address both knowledgebased and data-driven methods for modelling and acquiring semantic information, and discuss the role of semantic information in applications of language technology. The articles demonstrate the importance of semantic processing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting semantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. The collection highlights the importance of semantic processing for different areas and applications in Natural Language Processing, and provides the reader with an overview of current research in this field
    corecore