Search CORE

403 research outputs found

Recommended from our members

Representation Learning beyond Semantic Similarity: Character-aware and Function-specific Approaches

Author: Gerz Daniela Susanne
Publication venue: University of Cambridge
Publication date: 28/04/2020
Field of study

Representation learning is a research area within machine learning and natural language processing (NLP) concerned with building machine-understandable representations of discrete units of text. Continuous representations are at the core of modern machine learning applications, and representation learning has thereby become one of the central research areas in NLP. The induction of text representations is typically based on the distributional hypothesis, and consequently encodes general information about word similarity. Words or phrases with similar meaning obtain similar representations in a vector space constructed for this purpose. This established methodology excels for morphologically-simple languages such as English, and in data-rich settings. However, several useful lexical relations such as entailment or selectional preference, are not captured or get conflated with other relations. Another challenge is dealing with low-data regimes for morphologically-complex and under-resourced languages. In this thesis we construct novel representation learning methods that go beyond the limitations of the distributional hypothesis and investigate solutions that induce vector spaces with diverse properties. In particular, we look at how the vector space induction process influences the contained information, and how the information manifests in a number of core NLP tasks: semantic similarity, lexical entailment, selectional preference, and language modeling. We contribute novel evaluations of state-of-the-art models highlighting their current capabilities and limitations. An analysis of language modeling in 50 typologically-diverse languages demonstrates that representations can indeed pose a performance bottleneck. We introduce a novel approach to leveraging subword-level information in word representations: our solution lifts this bottleneck in low-resource scenarios. Finally, we introduce a novel paradigm of function-specific representation learning that aims to integrate fine-grained semantic relations and real-world knowledge into the word vector spaces. We hope this thesis can serve as a valuable overview on word representations, and inspire future work in modeling \textit{semantic similarity and beyond}.ERC Consolidator Grant LEXICAL (648909

Apollo (Cambridge)

Mapping the evolving landscape of child-computer interaction research: structures and processes of knowledge (re)production

Author: Mc Dermott Tiarnach
Publication venue
Publication date: 27/07/2023
Field of study

Implementing an iterative sequential mixed methods design (Quantitative → Qualitative → Quantitative) framed within a sociology of knowledge approach to discourse, this study offers an account of the structure of the field of Child-Computer Interaction (CCI), its development over time, and the practices through which researchers have (re)structured knowledge comprising the field. Thematic structure of knowledge within the field, and its evolution over time, is quantified through implementation of a Correlated Topic Model (CTM), an automated inductive content analysis method, in analysing 4,771 CCI research papers published between 2003 and 2021. Detailed understanding of practices through which researchers (re)structure knowledge within the field, including factors influencing these practices, is obtained through thematic analysis of online workshops involving prominent contributors to the field (n=7). Strategic practices utilised by researchers in negotiating tensions impeding integration of novel concepts in the field are investigated through analysis of semantic features of retrieved papers using linear and negative binomial regression models. Contributing an extensive mapping, results portray the field of CCI as a varied research landscape, comprising 48 major themes of study, which has evolved dynamically over time. Research priorities throughout the field have been subject to influence from a range of endogenous and exogenous factors which researchers actively negotiate through research and publication practices. Tacitly structuring research practices, these factors have broadly sustained a technology-driven, novelty-dominated paradigm throughout the field which has failed to substantively progress cumulative knowledge. Through strategic negotiation of persistent tensions arising as consequence of these factors, researchers have nonetheless affected structural change within the field, contributing to a shift towards a user needs-driven agenda and progression of knowledge therein. Findings demonstrate that the field of CCI is proceeding through an intermediary phase in maturation, forming an increasingly distinct disciplinary shape and identity through the cumulative structuring effect of community members’ continued negotiation of tensions

Oxford University Research Archive

Vector Semantics

Author: Kornai András
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book introduces Vector semantics, which links the formal theory of word vectors to the cognitive theory of linguistics. The computational linguists and deep learning researchers who developed word vectors have relied primarily on the ever-increasing availability of large corpora and of computers with highly parallel GPU and TPU compute engines, and their focus is with endowing computers with natural language capabilities for practical applications such as machine translation or question answering. Cognitive linguists investigate natural language from the perspective of human cognition, the relation between language and thought, and questions about conceptual universals, relying primarily on in-depth investigation of language in use. In spite of the fact that these two schools both have ‘linguistics’ in their name, so far there has been very limited communication between them, as their historical origins, data collection methods, and conceptual apparatuses are quite different. Vector semantics bridges the gap by presenting a formal theory, cast in terms of linear polytopes, that generalizes both word vectors and conceptual structures, by treating each dictionary definition as an equation, and the entire lexicon as a set of equations mutually constraining all meanings

OAPEN Library

Recommended from our members

Neurobiology of incremental speech comprehension

Author: Choi Hun Seok
Publication venue: University of Cambridge
Publication date: 25/06/2019
Field of study

Understanding spoken language requires the rapid transition from perceptual processing of the auditory input through a variety of cognitive processes involved in constructing the mental representation of the message that the speaker is intending to convey. Listeners carry out these complex processes very rapidly and accurately as they hear each word incrementally unfolding in a sentence. However, little is known about the specific spatiotemporal patterning of this wide range of incremental processing operations that underpin the dynamic transitions from the speech input to the development of a meaning interpretation of an utterance. This thesis aims to address this set of issues by investigating the spatiotemporal dynamics of brain activity as spoken sentences unfold over time in order to illuminate the neurocomputational properties of the human language processing system and determine how the representation of a spoken sentence develops incrementally as each upcoming word is heard. Using a novel application of multidimensional probabilistic modelling combined with models from computational linguistics, I developed models of a variety of computational processes associated with accessing and processing the syntactic and semantic properties of sentences and tested these models at various points as sentences unfolded over time. Since a wide range of incremental processes occur very rapidly during speech comprehension, it is crucial to keep track of the temporal dynamics of the neural computations involved. To do this, I used combined electroencephalography and magnetoencephalography (EMEG) to record neural activity with millisecond resolution and analyzed the recordings in source space using univariate and/or multivariate approaches. The results confirm the value of this combination of methods in examining the properties of incremental speech processing. My findings corroborate the predictive nature of human speech comprehension and demonstrate that the effects of early semantic constraint are not dependent on explicit syntactic knowledge

Apollo (Cambridge)

Learning Representations of Social Media Users

Author: Benton Adrian
Publication venue
Publication date: 02/12/2018
Field of study

User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

arXiv.org e-Print Archive

Johns Hopkins University

JScholarship

Learning Representations of Social Media Users

Author: Benton Adrian
Publication venue
Publication date: 01/05/2018
Field of study

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Memoria Académica

Servicio de Difusión de la Creación Intelectual

Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence

Author: Karjus Andres
Publication venue
Publication date: 24/09/2023
Field of study

The increasing capacities of large language models (LLMs) present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution proposes a systematic mixed methods framework to harness qualitative analytic expertise, machine scalability, and rigorous quantification, with attention to transparency and replicability. 16 machine-assisted case studies are showcased as proof of concept. Tasks include linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments. LLM (and human) annotations may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, this approach is not intended to replace, but to augment researcher knowledge and skills. With these opportunities in sight, qualitative expertise and the ability to pose insightful questions have arguably never been more critical

arXiv.org e-Print Archive

Recommended from our members

Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing

Author: Majewska Olga
Publication venue: University of Cambridge
Publication date: 01/02/2021
Field of study

Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs. To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909

Apollo (Cambridge)

Unsupervised Induction of Frame-Based Linguistic Forms

Author: Ferraro Francis
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 22/05/2018
Field of study

This thesis studies the use of bulk, structured, linguistic annotations in order to perform unsupervised induction of meaning for three kinds of linguistic forms: words, sentences, and documents. The primary linguistic annotation I consider throughout this thesis are frames, which encode core linguistic, background or societal knowledge necessary to understand abstract concepts and real-world situations. I begin with an overview of linguistically-based structured meaning representation; I then analyze available large-scale natural language processing (NLP) and linguistic resources and corpora for their abilities to accommodate bulk, automatically-obtained frame annotations. I then proceed to induce meanings of the different forms, progressing from the word level, to the sentence level, and finally to the document level. I first show how to use these bulk annotations in order to better encode linguistic- and cognitive science backed semantic expectations within word forms. I then demonstrate a straightforward approach for learning large lexicalized and refined syntactic fragments, which encode and memoize commonly used phrases and linguistic constructions. Next, I consider two unsupervised models for document and discourse understanding; one is a purely generative approach that naturally accommodates layer annotations and is the first to capture and unify a complete frame hierarchy. The other conditions on limited amounts of external annotations, imputing missing values when necessary, and can more readily scale to large corpora. These discourse models help improve document understanding and type-level understanding

Johns Hopkins University

JScholarship