4,766 research outputs found

    Visual and linguistic processes in deep neural networks:A cognitive perspective

    Get PDF
    When people describe an image, there are complex visual and linguistic processes at work. For instance, speakers tend to look at an object right before mentioning it, but not every time. Similarly, during a conversation, speakers can refer to an entity multiple times, using expressions evolving in the common ground. In this thesis, I develop computational models of such visual and linguistic processes, drawing inspiration from theories and findings from cognitive science and psycholinguistics. This work, where I aim to capture the intricate relationship between non-linguistic modalities and language within deep artificial neural networks, contributes to the line of research into multimodal Natural Language Processing. This thesis consists of two parts: (1) modeling human gaze in language use (production and comprehension), and (2) modeling communication strategies in referential tasks in visually grounded dialogue. In the first part, I delve into enhancing image description generation models using eye-tracking data; evaluating the variation in human signals while describing images; and predicting human reading behavior in the form of eye movements. In the second part, I build models quantifying, generating, resolving, and adapting utterances in referential tasks situated within visual and conversational contexts. The outcomes advance our understanding of human visuo-linguistic processes by revealing intricate strategies at play in such processes, and point to the importance of accounting for them when developing and utilizing multimodal models. The findings shed light on how the advancements in artificial intelligence could contribute to advancing the research on crossmodal processes in humans and vice versa

    Visual and linguistic processes in deep neural networks:A cognitive perspective

    Get PDF
    When people describe an image, there are complex visual and linguistic processes at work. For instance, speakers tend to look at an object right before mentioning it, but not every time. Similarly, during a conversation, speakers can refer to an entity multiple times, using expressions evolving in the common ground. In this thesis, I develop computational models of such visual and linguistic processes, drawing inspiration from theories and findings from cognitive science and psycholinguistics. This work, where I aim to capture the intricate relationship between non-linguistic modalities and language within deep artificial neural networks, contributes to the line of research into multimodal Natural Language Processing. This thesis consists of two parts: (1) modeling human gaze in language use (production and comprehension), and (2) modeling communication strategies in referential tasks in visually grounded dialogue. In the first part, I delve into enhancing image description generation models using eye-tracking data; evaluating the variation in human signals while describing images; and predicting human reading behavior in the form of eye movements. In the second part, I build models quantifying, generating, resolving, and adapting utterances in referential tasks situated within visual and conversational contexts. The outcomes advance our understanding of human visuo-linguistic processes by revealing intricate strategies at play in such processes, and point to the importance of accounting for them when developing and utilizing multimodal models. The findings shed light on how the advancements in artificial intelligence could contribute to advancing the research on crossmodal processes in humans and vice versa

    Function of intonation in task-oriented dialogue

    Get PDF
    This thesis addresses the question of how intonation functions in conversation. It examines the intonation and discourse function of single-word utterances in spontaneous and read-aloud task-oriented dialogue (HCRC Map Task Corpus containing Scottish English; see Anderson et al., 1991). To avoid some of the pitfalls of previous studies in which such comparisons of intonation and discourse structure tend to lack balance and focus more heavily on one analysis at the expense of the other, it employs independently developed analyses. They are the Conversational Games Analysis (as introduced in Kowtko, Isard and Doherty, 1992) and a simple target level representation of intonation. Correlations between categories of intonation and of discourse function in spontaneous dialogue suggest that intonation reflects the function of an utterance. Contrary to what one might expect from reading the literature, these categories are in some cases categories of exclusion rather than inclusion. Similar patterns result from the study of read-aloud dialogue. Discourse function and intonation categories show a measure of correlation. One difference that does appear between patterns across speech modes is that in many instances of discourse function intonation categories shift toward tunes ending low in the speaker's pitch range (e. g. a falling tune) for the read-aloud version. This result is in accord with other contemporary studies (e. g. Blaauw, 1995). The difference between spontaneous and read results suggests that read-aloud dialogue - even that based on scripts which include hesitations and false starts - is not a substitute for eliciting the same intonation strategies that are found in spontaneous dialogue

    An integrated theory of language production and comprehension

    Get PDF
    Currently, production and comprehension are regarded as quite distinct in accounts of language processing. In rejecting this dichotomy, we instead assert that producing and understanding are interwoven, and that this interweaving is what enables people to predict themselves and each other. We start by noting that production and comprehension are forms of action and action perception. We then consider the evidence for interweaving in action, action perception, and joint action, and explain such evidence in terms of prediction. Specifically, we assume that actors construct forward models of their actions before they execute those actions, and that perceivers of others' actions covertly imitate those actions, then construct forward models of those actions. We use these accounts of action, action perception, and joint action to develop accounts of production, comprehension, and interactive language. Importantly, they incorporate well-defined levels of linguistic representation (such as semantics, syntax, and phonology). We show (a) how speakers and comprehenders use covert imitation and forward modeling to make predictions at these levels of representation, (b) how they interweave production and comprehension processes, and (c) how they use these predictions to monitor the upcoming utterances. We show how these accounts explain a range of behavioral and neuroscientific data on language processing and discuss some of the implications of our proposal

    Floor Holder Detection and End of Speaker Turn Prediction in Meetings

    Get PDF
    We propose a novel fully automatic framework to detect which meeting participant is currently holding the conversational floor and when the current speaker turn is going to finish. Two sets of experiments were conducted on a large collection of multiparty conversations: the AMI meeting corpus. Unsupervised speaker turn detection was performed by post-processing the speaker diarization and the speech activity detection outputs. A supervised end-of-speaker-turn prediction framework, based on Dynamic Bayesian Networks and automatically extracted multimodal features (related to prosody, overlapping speech, and visual motion), was also investigated. These novel approaches resulted in good floor holder detection rates (13:2% Floor Error Rate), attaining state of the art end-of-speaker-turn prediction performances
    • 

    corecore