    Extended finite state models of language

    not be included here because of space constraints or because the authors felt that their subsequent work took a direction that they no longer consider the workshop paper fully representative of their current thinking. In particular, we call attention to the tutorial paper by Jelinek (excerpted from a his forthcoming book (Jelinek 1977)), the paper by Mohri, Pereira, and Riley describing the AT&T/Bell Labs approach to language modeling using weighted transducers, and the paper by Oehrle on binding and anaphora. Even without these papers, the sheer size of the proceedings made it impossible to include the same material in this issue of JNLE, and the participants were asked to prepare shorter versions (in some cases, extended abstracts) for inclusion here. A full version of these papers, taking into account the comments received at the workshop, will be published later this year by Cambridge University Press. In addition, a formal Call For Papers yielded several new papers for this issu

    A természetes nyelvek formális modelljeiről

    On Folding and Twisting (and whatknot): towards a characterization of workspaces in syntax

    Syntactic theory has traditionally adopted a constructivist approach, in which a set of atomic elements are manipulated by combinatory operations to yield derived, complex elements. Syntactic structure is thus seen as the result or discrete recursive combinatorics over lexical items which get assembled into phrases, which are themselves combined to form sentences. This view is common to European and American structuralism (e.g., Benveniste, 1971; Hockett, 1958) and different incarnations of generative grammar, transformational and non-transformational (Chomsky, 1956, 1995; and Kaplan & Bresnan, 1982; Gazdar, 1982). Since at least Uriagereka (2002), there has been some attention paid to the fact that syntactic operations must apply somewhere, particularly when copying and movement operations are considered. Contemporary syntactic theory has thus somewhat acknowledged the importance of formalizing aspects of the spaces in which elements are manipulated, but it is still a vastly underexplored area. In this paper we explore the consequences of conceptualizing syntax as a set of topological operations applying over spaces rather than over discrete elements. We argue that there are empirical advantages in such a view for the treatment of long-distance dependencies and cross-derivational dependencies: constraints on possible configurations emerge from the dynamics of the system.Comment: Manuscript. Do not cite without permission. Comments welcom

    I. Magyar Számítógépes Nyelvészeti Konferencia

    Composite pseudogrammars based on parallel language models of Serbian

    Циљ овог рада је да предочи предности коришћења композитних интелигентних система заснованих на паралелним архитектурама, а пре свега предност композитних псеудограматика заснованих на паралелним језичким моделима у обради, генерисању и евалуацији природног језика, и то поготово српског. У њему је најпре дат кратак увод у теорију формалних језика, предочене су различите врсте граматика и дат је преглед радова из области креирања њихових апроксимација. Описани су појмови псеудограматика и језичких модела и приказан је њихов историјски развој, са највећим акцентом на тренутно стање и најактуалније методе моделовања језика и језичке моделе. Уведена је проблематика евалуације квалитета текста, и описане су различите методе полу-аутоматске и аутоматске евалуације. У другом делу рада описана су два експеримента која су имала за циљ да утврде методологију креирања композитних система за потребе моделовања српског језика, при чему су описани начини креирања различитих репрезентација докумената и различити начини комбиновања излаза самосталних система у обради природног језика. Паралелни системи су том приликом успешно тестирани на задацима обележавања врста речи и утврђивања ауторства кроз моделовања мини-језика, где су остварили значајно боље резултате од самосталних метода. Коначно, описан је процес обучавања серије генеративних предобучених трансформера над различитим репрезентацијама корпуса српског језика и креирања композитних псеудограматика заснованих на тим моделима и различитим методама комбиновања. Развијени системи су евалуирани на задацима оцењивања квалитета текста, те проналажења и исправљања грешака. Приказани резултати издвојили су наслагани обучени класификатор као оптимални метод комбиновања језичких модела у јединствену псеудограматику.The aim of this paper is to present the advantages of using composite intelligent systems based on parallel architectures and, above all, the advantage of composite pseudogrammars based on parallel language models in the processing, generation, and evaluation of natural languages, especially Serbian. First a brief introduction to the theory of formal languages is given, distinct types of grammars are described an overview of papers in the field of creating their approximations were presented. The concepts of pseudogrammars and language models were described together with their historical development, with the emphasis on the current state-of-the-art and the best methods of language modelling and currently top-performing language models. The issue of quality evaluation of a text is introduced, and various methods of semi-automatic and automatic evaluation are described. In the second part of the paper, two experiments were described that aimed to determine the methodology of creating composite systems for the needs of modelling the Serbian language, where the ways of creating different representations of documents and diverse ways of combining the outputs of independent natural language processing systems were described. On that occasion, parallel systems were successfully tested on the tasks of part-of-speech tagging and authorship attribution through mini-language modelling, for which they achieved significantly better results than independent methods. Finally, the process of training a series of generative pretrained transformers on different representations of the corpus of the Serbian language and creating composite pseudogrammars based on those models and different combining methods is described. The developed systems were evaluated on the tasks of text quality evaluation and finding and correcting errors in the text. The presented results singled out the stacked trained classifier as the optimal method of combining language models into a unique pseudogrammar

    On the metatheory of linguistics

    Mathematical linguistics

