Search CORE

10 research outputs found

Building and Evaluating Open-Vocabulary Language Models

Author: Mielke Sebastian J
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 03/01/2024
Field of study

Language models have always been a fundamental NLP tool and application. This thesis focuses on open-vocabulary language models, i.e., models that can deal with novel and unknown words at runtime. We will propose both new ways to construct such models as well as use such models in cross-linguistic evaluations to answer questions of difficulty and language-specificity in modern NLP tools. We start by surveying linguistic background as well as past and present NLP approaches to tokenization and open-vocabulary language modeling (Mielke et al., 2021). Thus equipped, we establish desirable principles for such models, both from an engineering mindset as well as a linguistic one and hypothesize a model based on the marriage of neural language modeling and Bayesian nonparametrics to handle a truly infinite vocabulary, boasting attractive theoretical properties and mathematical soundness, but presenting practical implementation difficulties. As a compromise, we thus introduce a word-based two-level language model that still has many desirable characteristics while being highly feasible to run (Mielke and Eisner, 2019). Unlike the more dominant approaches of characters or subword units as one-layer tokenization it uses words; its key feature is the ability to generate novel words in context and in isolation. Moving on to evaluation, we ask: how do such models deal with the wide variety of languages of the world---are they struggling with some languages? Relating this question to a more linguistic one, are some languages inherently more difficult to deal with? Using simple methods, we show that indeed they are, starting with a small pilot study that suggests typological predictors of difficulty (Cotterell et al., 2018). Thus encouraged, we design a far bigger study with more powerful methodology, a principled and highly feasible evaluation and comparison scheme based again on multi-text likelihood (Mielke et al., 2019). This larger study shows that the earlier conclusion of typological predictors is difficult to substantiate, but also offers a new insight on the complexity of Translationese. Following that theme, we end by extending this scheme to machine translation models to answer questions traditional evaluation metrics like BLEU cannot (Bugliarello et al., 2020)

Johns Hopkins University

Linguistic Competence and New Empiricism in Philosophy and Science

Author: Subotić Vanja
Publication venue
Publication date: 01/01/2023
Field of study

The topic of this dissertation is the nature of linguistic competence, the capacity to understand and produce sentences of natural language. I defend the empiricist account of linguistic competence embedded in the connectionist cognitive science. This strand of cognitive science has been opposed to the traditional symbolic cognitive science, coupled with transformational-generative grammar, which was committed to nativism due to the view that human cognition, including language capacity, should be construed in terms of symbolic representations and hardwired rules. Similarly, linguistic competence in this framework was regarded as being innate, rule-governed, domain-specific, and fundamentally different from performance, i.e., idiosyncrasies and factors governing linguistic behavior. I analyze state-of-the-art connectionist, deep learning models of natural language processing, most notably large language models, to see what they can tell us about linguistic competence. Deep learning is a statistical technique for the classification of patterns through which artificial intelligence researchers train artificial neural networks containing multiple layers that crunch a gargantuan amount of textual and/or visual data. I argue that these models suggest that linguistic competence should be construed as stochastic, pattern-based, and stemming from domain-general mechanisms. Moreover, I distinguish syntactic from semantic competence, and I show for each the ramifications of the endorsement of a connectionist research program as opposed to the traditional symbolic cognitive science and transformational-generative grammar. I provide a unifying front, consisting of usage-based theories, a construction grammar approach, and an embodied approach to cognition to show that the more multimodal and diverse models are in terms of architectural features and training data, the stronger the case is for the connectionist linguistic competence. I also propose to discard the competence vs. performance distinction as theoretically inferior so that a novel and integrative account of linguistic competence originating in connectionism and empiricism that I propose and defend in the dissertation could be put forward in scientific and philosophical literature

PhilPapers

The evolution of language: Proceedings of the Joint Conference on Language Evolution (JCoLE)

Author
Publication venue: Joint Conference on Language Evolution (JCoLE)
Publication date: 01/01/2022
Field of study

MPG.PuRe

K + K = 120 : Papers dedicated to László Kálmán and András Kornai on the occasion of their 60th birthdays

Author
Publication venue: Research Institute for Linguistics, Hungarian Academy of Sciences
Publication date: 01/01/2019
Field of study

Repository of the Academy's Library

24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Author
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

Author: Abramova Ekaterina
Adorni Giovanni
Agrawal Ruchit
Aina Laura
Albanese Teresa
Albanesi Davide
Alzetta Chiara
Amore Matteo
Antonelli Oronzo
Aprosio Alessio Palmero
Balaraman Vevake
Basile Pierpaolo
Basile Valerio
Basili Roberto
Bassignana Elisa
Bellandi Andrea
Bentivogli Luisa
Bernardi Raffaella
Bertoldi Nicola
Bondielli Alessandro
Bos Johan
Bosco Cristina
Bottini Roberto
Brunato Dominique
Brunato⋄ Dominique
Buono Maria Pia di
Busso Lucia
Büchler Marco
Cabrio Elena
Caruso Valeria
Caselli Tommaso
Cecchini Flavio
Celli Fabio
Cervone Alessandra
Chesi Cristiano
Chingacham Anupama
Chiriatti Giulia
Cimino Andrea
Cocciu• Eleonora
Colla Davide
Comandini Gloria
Cordeiro Silvio Ricardo
Crepaldi Davide
Croce Danilo
Curtoni Paolo
Cutugno Francesco
dell’Oglio Pietro
Dell’Orletta Felice
Dell’Orletta⋄ Felice
De Felice Irene
De Martino Maria
Dini Luca
Di Iorio Angelo
Di Nunzio Giorgio Maria
Draetta Lia
Ducceschi Luca
Elia Annibale
Falavigna Daniele
Federico Marcello
Feltracco Anna
Fernández Raquel
Ferro Michele
Fieromonte Martina
Franzini Greta
Gagliardi Gloria
Gala Valentina Della
Gambi Enrico
Ghezzi Ilaria
Giovannetti Emiliano
Gobbi Jacopo
Gretter Roberto
Guarasci Raffaele
Guerini Marco
Gurevych Iryna
Günther Fritz
Herzog Leonardo
Jezek Elisabetta
Koceva Forsina
Lai Mirko
Laudanna Alessandro
Lenci Alessandro
Lepri Bruno
Liano Annarita
Limpens Freddy
Louvan Samuel
Lyding Verena
Magnini Bernardo
Magnolini Simone
Mairano Paolo
Mambrini Francesco
Mana Dario
Mancuso Azzurra
Marchi Simone
Marelli Marco
Marini Costanza
Mazzei Alessandro
McGregor Stephen
Melnikova Elena
Menini Stefano
Mensa Enrico
Merenda Flavio
Mollo Eleonora
Montemagni Simonetta
Montemagni⋄ Simonetta
Monti Johanna
Moretti Giovanni
Moritz Maria
Nadalini Andrea
Negri Matteo
Nicolas Lionel
Nissim Malvina
Novielli Nicole
Okinina Nadezda
Pannitto Ludovica
Paperno Denis
Passalacqua Samuele
Passaro Lucia C.
Passarotti Marco
Patti Viviana
Pecchioli Alessandra
Pellegrini Matteo
Petrolito Ruggero
Pettenati Maria Chiara
Piantanida Giovanni
Poggi Isabella
Porporato Aureliano
Quinci Vito
Radicioni Daniele P.
Ramisch Carlos
Rapp Amon
Riccardi Giuseppe
Rossini Daniele
Rotondi Agata
Ruffolo Paolo
Russo Irene
Sagri Maria Teresa
Sangati Federico
Sanguinetti Manuela
Savary Agata
Savy Renata
Simeoni Rossana
Simi Maria
Sorgente Antonio
Speranza Manuela
Sprugnoli Rachele
Stede Manfred
Stepanov Evgeny A.
Stingo Michele
Tamburini Fabio
Tebbifakhr Amirhossein
Tonelli Sara
Torre Ilaria
Tortoreto Giuliano
Totis Pietro
Trotta Daniela
Turchi Marco
Valeriani Martina
Venturi Giulia
Venturi⋄ Giulia
Vezzani Federica
Villata Serena
Vincze Veronika
Zaghi Claudia
Zovato Enrico
Publication venue: 'OpenEdition'
Publication date: 08/04/2019
Field of study

On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

OpenEdition

Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

Author
Publication venue: 'OpenEdition'
Publication date: 01/07/2022
Field of study

On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

Directory of Open Access Books (DOAB)

Decoding the Mental States of Focus and Distraction in a Real Life Setting of Tibetan Monastic Deabtes Using EEG and Machine Learning

Author: Kaushik Pallavi
Roy Partha Pratim
van Vugt Marieke
Publication venue: Applied Cognitive Science Lab, Penn State
Publication date: 01/01/2020
Field of study

ARTS repository - University of Groningen

Recommended from our members

Understanding Semantic Implicit Learning through distributional linguistic patterns: A computational perspective

Author: Alikaniotis Dimitrios
Publication venue: University of Cambridge
Publication date: 27/06/2019
Field of study

The research presented in this PhD dissertation provides a computational perspective on Semantic Implicit Learning (SIL). It puts forward the idea that SIL does not depend on semantic knowledge as classically conceived but upon semantic-like knowledge gained through distributional analysis of massive linguistic input. Using methods borrowed from the machine learning and artificial intelligence literature, we construct computational models, which can simulate the performance observed during behavioural tasks of semantic implicit learning in a human-like way. We link this methodology to the current literature on implicit learning, arguing that this behaviour is a necessary by-product of efficient language processing. Chapter 1 introduces the computational problem posed by implicit learning in general, and semantic implicit learning, in particular, as well as the computational framework, used to tackle them. Chapter 2 introduces distributional semantics models as a way to learn semantic-like representations from exposure to linguistic input. Chapter 3 reports two studies on large datasets of semantic priming which seek to identify the computational model of semantic knowledge that best fits the data under conditions that resemble SIL tasks. We find that a model which acquires semantic-like knowledge gained through distributional analysis of massive linguistic input provides the best fit to the data. Chapter 4 generalises the results of the previous two studies by looking at the performance of the same models in languages other than English. Chapter 5 applies the results of the two previous Chapters on eight datasets of semantic implicit learning. Crucially, these datasets use various semantic manipulations and speakers of different L1s enabling us to test the predictions of different models of semantics. Chapter 6 examines more closely two assumptions which we have taken for granted throughout this thesis. Firstly, we test whether a simpler model based on phonological information can explain the generalisation patterns observed in the tasks. Secondly, we examine whether our definition of the computational problem in Chapter 5 is reasonable. Chapter 7 summarises and discusses the implications for implicit language learning and computational models of cognition. Furthermore, we offer one more study that seeks to bridge the literature on distributional models of semantics to `deeper' models of semantics by learning semantic relations. There are two main contributions of this dissertation to the general field of implicit learning research. Firstly, we highlight the superiority of distributional models of semantics in modelling unconscious semantic knowledge. Secondly, we question whether `deep' semantic knowledge is needed to achieve above chance performance in SIIL tasks. We show how a simple model that learns through distributional analysis of the patterns found in the linguistic input can match the behavioural results in different languages. Furthermore, we link these models to more general problems faced in psycholinguistics such as language processing and learning of semantic relations.Alexandros Onassis Foundatio

Apollo (Cambridge)

Approches Neuronales pour la Reconstruction de Mots Historiques

Author: Fourrier Clémentine
Publication venue: HAL CCSD
Publication date: 26/09/2022
Field of study

In historical linguistics, cognates are words that descend in direct line from a common ancestor, called their proto-form, andtherefore are representative of their respective languages evolutions through time, as well as of the relations between theselanguages synchronically. As they reflect the phonetic history of the languages they belong to, they allow linguists to betterdetermine all manners of synchronic and diachronic linguistic relations (etymology, phylogeny, sound correspondences).Cognates of related languages tend to be linked through systematic phonetic correspondence patterns, which neuralnetworks could well learn to model, being especially good at learning latent patterns. In this dissertation, we seek tomethodically study the applicability of machine translation inspired neural networks to historical word prediction, relyingon the surface similarity of both tasks. We first create an artificial dataset inspired by the phonetic and phonotactic rules ofRomance languages, which allow us to vary task complexity and data size in a controlled environment, therefore identifyingif and under which conditions neural networks were applicable. We then extend our work to real datasets (after havingupdated an etymological database to gather a correct amount of data), study the transferability of our conclusions toreal data, then the applicability of a number of data augmentation techniques to the task, to try to mitigate low-resourcesituations. We finally investigat in more detail our best models, multilingual neural networks. We first confirm that, onthe surface, they seem to capture language relatedness information and phonetic similarity, confirming prior work. Wethen discover, by probing them, that the information they store is actually more complex: our multilingual models actuallyencode a phonetic language model, and learn enough latent historical information to allow decoders to reconstruct the(unseen) proto-form of the studied languages as well or better than bilingual models trained specifically on the task. Thislatent information is likely the explanation for the success of multilingual methods in the previous worksEn linguistique historique, les cognats sont des mots qui descendent en ligne directe d'un ancêtre commun, leur proto-forme, et qui sont ainsi représentatifs de l'évolution de leurs langues respectives à travers le temps. Comme ils portent eneux l'histoire phonétique des langues auxquelles ils appartiennent, ils permettent aux linguistes de mieux déterminer toutessortes de relations linguistiques synchroniques et diachroniques (étymologie, phylogénie, correspondances phonétiques).Les cognats de langues apparentées sont liés par des correspondances phonétiques systématiques. Les réseaux deneurones, particulièrement adaptés à l'apprentissage de motifs latents, semblent donc bien un bon outil pour modéliserces correspondances. Dans cette thèse, nous cherchons donc à étudier méthodiquement l'applicabilité de réseaux deneurones spécifiques (inspirés de la traduction automatique) à la `prédiction de mots historiques', en nous appuyantsur les similitudes entre ces deux tâches. Nous créons tout d'abord un jeu de données artificiel à partir des règlesphonétiques et phonotactiques des langues romanes, que nous utilisons pour étudier l'utilisation de nos réseaux ensituation controlée, et identifions ainsi sous quelles conditions les réseaux de neurones sont applicables à notre tâched'intérêt. Nous étendons ensuite notre travail à des données réelles (après avoir mis à jour une base étymologiquespour obtenir d'avantage de données), étudions si nos conclusions précédentes leur sont applicables, puis s'il est possibled'utiliser des techniques d'augmentation des données pour pallier aux manque de ressources de certaines situations.Enfin, nous analysons plus en détail nos meilleurs modèles, les réseaux neuronaux multilingues. Nous confirmons àpartir de leurs résultats bruts qu'ils semblent capturer des informations de parenté linguistique et de similarité phonétique,ce qui confirme des travaux antérieurs. Nous découvrons ensuite en les sondant (probing) que les informations qu'ilsstockent sont en fait plus complexes : nos modèles multilingues encodent en fait un modèle phonétique de la langue, etapprennent suffisamment d'informations diachroniques latentes pour permettre à des décodeurs de reconstruire la proto-forme (non vue) des langues étudiées aussi bien, voire mieux, que des modèles bilingues entraînés spécifiquement surcette tâche. Ces informations latentes expliquent probablement le succès des méthodes multilingues dans les travauxprécédents

INRIA a CCSD electronic archive server