6,851 research outputs found

    Methodological considerations concerning manual annotation of musical audio in function of algorithm development

    Get PDF
    In research on musical audio-mining, annotated music databases are needed which allow the development of computational tools that extract from the musical audiostream the kind of high-level content that users can deal with in Music Information Retrieval (MIR) contexts. The notion of musical content, and therefore the notion of annotation, is ill-defined, however, both in the syntactic and semantic sense. As a consequence, annotation has been approached from a variety of perspectives (but mainly linguistic-symbolic oriented), and a general methodology is lacking. This paper is a step towards the definition of a general framework for manual annotation of musical audio in function of a computational approach to musical audio-mining that is based on algorithms that learn from annotated data. 1

    Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

    Full text link
    Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each of several areas, it proposes a common framework for future tool development, data annotation and resource sharing based upon annotation graphs and servers.Comment: 8 pages, 6 figure

    Interactive Pattern Recognition applied to Natural Language Processing

    Full text link
    This thesis is about Pattern Recognition. In the last decades, huge efforts have been made to develop automatic systems able to rival human capabilities in this field. Although these systems achieve high productivity rates, they are not precise enough in most situations. Humans, on the contrary, are very accurate but comparatively quite slower. This poses an interesting question: the possibility of benefiting from both worlds by constructing cooperative systems. This thesis presents diverse contributions to this kind of collaborative approach. The point is to improve the Pattern Recognition systems by properly introducing a human operator into the system. We call this Interactive Pattern Recognition (IPR). Firstly, a general proposal for IPR will be stated. The aim is to develop a framework to easily derive new applications in this area. Some interesting IPR issues are also introduced. Multi-modality or adaptive learning are examples of extensions that can naturally fit into IPR. In the second place, we will focus on a specific application. A novel method to obtain high quality speech transcriptions (CAST, Computer Assisted Speech Transcription). We will start by proposing a CAST formalization and, next, we will cope with different implementation alternatives. Practical issues, as the system response time, will be also taken into account, in order to allow for a practical implementation of CAST. Word graphs and probabilistic error correcting parsing are tools that will be used to reach an alternative formulation that allows for the use of CAST in a real scenario. Afterwards, a special application within the general IPR framework will be discussed. This is intended to test the IPR capabilities in an extreme environment, where no input pattern is available and the system only has access to the user actions to produce a hypothesis. Specifically, we will focus here on providing assistance in the problem of text generation.Rodríguez Ruiz, L. (2010). Interactive Pattern Recognition applied to Natural Language Processing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8479Palanci

    Image speech combination for interactive computer assisted transcription of handwritten documents

    Full text link
    [EN] Handwritten document transcription aims to obtain the contents of a document to provide efficient information access to, among other, digitised historical documents. The increasing number of historical documents published by libraries and archives makes this an important task. In this context, the use of image processing and understanding techniques in conjunction with assistive technologies reduces the time and human effort required for obtaining the final perfect transcription. The assistive transcription system proposes a hypothesis, usually derived from a recognition process of the handwritten text image. Then, the professional transcriber feedback can be used to obtain an improved hypothesis and speed-up the final transcription. In this framework, a speech signal corresponding to the dictation of the handwritten text can be used as an additional source of information. This multimodal approach, that combines the image of the handwritten text with the speech of the dictation of its contents, could make better the hypotheses (initial and improved) offered to the transcriber. In this paper we study the feasibility of a multimodal interactive transcription system for an assistive paradigm known as Computer Assisted Transcription of Text Images. Different techniques are tested for obtaining the multimodal combination in this framework. The use of the proposed multimodal approach reveals a significant reduction of transcription effort with some multimodal combination techniques, allowing for a faster transcription process.Work partially supported by projects READ-674943 (European Union's H2020), SmartWays-RTC-2014-1466-4 (MINECO, Spain), and CoMUN-HaT-TIN2015-70924-C2-1-R (MINECO/FEDER), and by Generalitat Valenciana (GVA), Spain under reference PROMETEOII/2014/030.Granell, E.; Romero, V.; Martínez-Hinarejos, C. (2019). Image speech combination for interactive computer assisted transcription of handwritten documents. Computer Vision and Image Understanding. 180:74-83. https://doi.org/10.1016/j.cviu.2019.01.009S748318

    Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

    Full text link
    Recent advancements in surgical computer vision applications have been driven by fully-supervised methods, primarily using only visual data. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively show the representation capability of the learned joint latent space, we introduce several vision-and-language tasks for surgery, such as text-based video retrieval, temporal activity grounding, and video captioning, as benchmarks for evaluation. We further demonstrate that without using any labeled ground truth, our approach can be employed for traditional vision-only surgical downstream tasks, such as surgical tool, phase, and triplet recognition. The code will be made available at https://github.com/CAMMA-public/SurgVL

    Audio-visual modelling in a clinical setting

    Get PDF
    Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow

    Audio-visual modelling in a clinical setting

    Get PDF
    Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow

    How affects can perturbe the automatic speech recognition of domotic interactions

    No full text
    International audienceIn Smart Home, the vocal home automation orders, for comfort purposes, or assistive devoted, have been pointed as the more relevant interaction for ambient assisted living. Even if the orders are very strictly formulated, when they are daily used (directed to the smart home, or to a robot mediator), they become often pronounced with various affects. In this paper we have evaluated how some state of the art ASR systems shut down with expressive orders, acted or spontaneous, and how the ASR training with neutral and/or acted and/or spontaneous expressive commands corpus can greatly modify the ASR performances

    DMRN+16: Digital Music Research Network One-day Workshop 2021

    Get PDF
    DMRN+16: Digital Music Research Network One-day Workshop 2021 Queen Mary University of London Tuesday 21st December 2021 Keynote speakers Keynote 1. Prof. Sophie Scott -Director, Institute of Cognitive Neuroscience, UCL. Title: "Sound on the brain - insights from functional neuroimaging and neuroanatomy" Abstract In this talk I will use functional imaging and models of primate neuroanatomy to explore how sound is processed in the human brain. I will demonstrate that sound is represented cortically in different parallel streams. I will expand this to show how this can impact on the concept of auditory perception, which arguably incorporates multiple kinds of distinct perceptual processes. I will address the roles that subcortical processes play in this, and also the contributions from hemispheric asymmetries. Keynote 2: Prof. Gus Xia - Assistant Professor at NYU Shanghai Title: "Learning interpretable music representations: from human stupidity to artificial intelligence" Abstract Gus has been leading the Music X Lab in developing intelligent systems that help people better compose and learn music. In this talk, he will show us the importance of music representation for both humans and machines, and how to learn better music representations via the design of inductive bias. Once we got interpretable music representations, the potential applications are limitless

    VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

    Full text link
    The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves 2.79% false rejection rate and 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.Comment: 5 pages. Accepted at ICASSP202
    • …
    corecore