76 research outputs found

    Embodied Processing at Six Linguistic Granularity Levels: A Consensus Paper

    Get PDF
    Language processing is influenced by sensorimotor experiences. Here, we review behavioral evidence for embodied and grounded influences in language processing across six linguistic levels of granularity. We examine (a) sub-word features, discussing grounded influences on iconicity (systematic associations between word form and meaning); (b) words, discussing boundary conditions and generalizations for the simulation of color, sensory modality, and spatial position; (c) sentences, discussing boundary conditions and applications of action direction simulation; (d) texts, discussing how the teaching of simulation can improve comprehension in beginning readers; (e) conversations, discussing how multi-modal cues improve turn taking and alignment; and (f) text corpora, discussing how distributional semantic models can reveal how grounded and embodied knowledge is encoded in texts. These approaches are converging on a convincing account of the psychology of language, but at the same time, there are important criticisms of the embodied approach and of specific experimental paradigms. The surest way forward requires the adoption of a wide array of scientific methods. By providing complimentary evidence, a combination of multiple methods on various levels of granularity can help us gain a more complete understanding of the role of embodiment and grounding in language processing

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Modelling multimodal language processing

    Get PDF

    Attention Mechanism for Adaptive Feature Modelling

    Get PDF
    This thesis presents groundbreaking contributions in machine learning by exploring and advancing attention mechanisms within deep learning frameworks. We introduce innovative models and techniques that significantly enhance feature recognition and analysis in two key application areas: computer vision recognition and time series modeling. Our primary contributions include the development of a dual attention mechanism for crowd counting and the integration of supervised and unsupervised learning techniques for semi-supervised learning. Furthermore, we propose a novel Dynamic Unary Convolution in Transformer (DUCT) model for generalized visual recognition tasks, and investigate the efficacy of attention mechanisms in human activity recognition using time series data from wearable sensors based on the semi-supervised setting. The capacity of humans to selectively focus on specific elements within complex scenes has long inspired machine learning research. Attention mechanisms, which dynamically modify weights to emphasize different input elements, are central to replicating this human perceptual ability in deep learning. These mechanisms have proven crucial in achieving significant advancements across various tasks. In this thesis, we first provide a comprehensive review of the existing literature on attention mechanisms. We then introduce a dual attention mechanism for crowd counting, which employs both second-order and first-order attention to enhance spatial information processing and feature distinction. Additionally, we explore the convergence of supervised and unsupervised learning, focusing on a novel semi-supervised method that synergizes labeled and unlabeled data through an attention-driven recurrent unit and dual loss functions. This method aims to refine crowd counting in practical transportation scenarios. Moreover, our research extends to a hybrid attention model for broader visual recognition challenges. By merging convolutional and transformer layers, this model adeptly handles multi-level features, where the DUCT modules play a pivotal role. We rigorously evaluate DUCT's performance across critical computer vision tasks. Finally, recognizing the significance of time series data in domains like health surveillance, we apply our proposed attention mechanism to human activity recognition, analyzing correlations between various daily activities to enhance the adaptability of deep learning frameworks to temporal dynamics

    Computer analysis of children's non-native English speech for language learning and assessment

    Get PDF
    Children's ASR appears to be more challenging than adults' and it's even more difficult when it comes to non-native children's speech. This research investigates different techniques to compensate for the effects of non-native and children on the performance of ASR systems. The study mainly utilises hybrid DNN-HMM systems with conventional DNNs, LSTMs and more advanced TDNN models. This work uses the CALL-ST corpus and TLT-school corpus to study children's non-native English speech. Initially, data augmentation was explored on the CALL-ST corpus to address the lack of data problem using the AMI corpus and PF-STAR German corpus. Feature selection, acoustic model adaptation and selection were also investigated on CALL-ST. More aspects of the ASR system, including pronunciation modelling, acoustic modelling, language modelling and system fusion, were explored on the TLT-school corpus as this corpus has a bigger amount of data. Then, the relationships between the CALL-ST and TLT-school corpora were studied and utilised to improve ASR performance. The other part of the present work is text processing for non-native children's English speech. We focused on providing accept/reject feedback to learners based on the text generated by the ASR system from learners' spoken responses. A rule-based and a machine learning-based system were proposed for making the judgement, several aspects of the systems were evaluated. The influence of the ASR system on the text processing system was explored

    Integrating Gestures

    Get PDF
    Gestures convey information about culture, discourse, thought, intentionality, emotion, intersubjectivity, cognition, and first and second language acquisition. Additionally, they are used by non-human primates to communicate with their peers and with humans. Consequently, the modern field of gesture studies has attracted researchers from a number of different disciplines such as anthropology, cognitive science, communication, neuroscience, psycholinguistics, primatology, psychology, robotics, sociology and semiotics. This volume presents an overview of the depth and breadth of current research in gesture. Its focus is on the interdisciplinary nature of gesture. The chapters included in the volume are divided into six themes: the nature and functions of gesture, first language development and gesture, second language effects on gesture, gesture in the classroom and in problem solving, gesture aspects of discourse and interaction, and gestural analysis of music and dance

    Semantic radical consistency and character transparency effects in Chinese: an ERP study

    Get PDF
    BACKGROUND: This event-related potential (ERP) study aims to investigate the representation and temporal dynamics of Chinese orthography-to-semantics mappings by simultaneously manipulating character transparency and semantic radical consistency. Character components, referred to as radicals, make up the building blocks used dur...postprin

    Multi-modal surrogates for retrieving and making sense of videos: is synchronization between the multiple modalities optimal?

    Get PDF
    Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sense-making. In the past decades, most research and development efforts on video surrogates have focused on visual features of the video, and comparatively little work has been done on audio surrogates and examining their pros and cons in aiding users' retrieval and sense-making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio and visual modalities. This research examined the effectiveness of a number of multi-modal surrogates, and investigated whether synchronization between the audio and visual channels is optimal. A user study was conducted to evaluate six different surrogates on a set of six recognition and inference tasks to answer two main research questions: (1) How do automatically-generated multi-modal surrogates compare to manually-generated ones in video retrieval and video sense-making? and (2) Does synchronization between multiple surrogate channels enhance or inhibit video retrieval and video sense-making? Forty-eight participants participated in the study, in which the surrogates were measured on the the time participants spent on experiencing the surrogates, the time participants spent on doing the tasks, participants' performance accuracy on the tasks, participants' confidence in their task responses, and participants' subjective ratings on the surrogates. On average, the uncoordinated surrogates were more helpful than the coordinated ones, but the manually-generated surrogates were only more helpful than the automatically-generated ones in terms of task completion time. Participants' subjective ratings were more favorable for the coordinated surrogate C2 (Magic A + V) and the uncoordinated surrogate U1 (Magic A + Storyboard V) with respect to usefulness, usability, enjoyment, and engagement. The post-session questionnaire comments demonstrated participants' preference for the coordinated surrogates, but the comments also revealed the value of having uncoordinated sensory channels
    • …
    corecore