6,851 research outputs found
Methodological considerations concerning manual annotation of musical audio in function of algorithm development
In research on musical audio-mining, annotated music databases are needed which allow the development of computational tools that extract from the musical audiostream the kind of high-level content that users can deal with in Music Information Retrieval (MIR) contexts. The notion of musical content, and therefore the notion of annotation, is ill-defined, however, both in the syntactic and semantic sense. As a consequence, annotation has been approached from a variety of perspectives (but mainly linguistic-symbolic oriented), and a general methodology is lacking. This paper is a step towards the definition of a general framework for manual annotation of musical audio in function of a computational approach to musical audio-mining that is based on algorithms that learn from annotated data. 1
Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development
Annotation graphs and annotation servers offer infrastructure to support the
analysis of human language resources in the form of time-series data such as
text, audio and video. This paper outlines areas of common need among empirical
linguists and computational linguists. After reviewing examples of data and
tools used or under development for each of several areas, it proposes a common
framework for future tool development, data annotation and resource sharing
based upon annotation graphs and servers.Comment: 8 pages, 6 figure
Interactive Pattern Recognition applied to Natural Language Processing
This thesis is about Pattern Recognition. In the last decades, huge efforts have been
made to develop automatic systems able to rival human capabilities in this field. Although
these systems achieve high productivity rates, they are not precise enough in
most situations. Humans, on the contrary, are very accurate but comparatively quite
slower. This poses an interesting question: the possibility of benefiting from both
worlds by constructing cooperative systems.
This thesis presents diverse contributions to this kind of collaborative approach.
The point is to improve the Pattern Recognition systems by properly introducing a
human operator into the system. We call this Interactive Pattern Recognition (IPR).
Firstly, a general proposal for IPR will be stated. The aim is to develop a framework
to easily derive new applications in this area. Some interesting IPR issues are
also introduced. Multi-modality or adaptive learning are examples of extensions that
can naturally fit into IPR.
In the second place, we will focus on a specific application. A novel method to
obtain high quality speech transcriptions (CAST, Computer Assisted Speech Transcription).
We will start by proposing a CAST formalization and, next, we will cope
with different implementation alternatives. Practical issues, as the system response
time, will be also taken into account, in order to allow for a practical implementation
of CAST. Word graphs and probabilistic error correcting parsing are tools that will
be used to reach an alternative formulation that allows for the use of CAST in a real
scenario.
Afterwards, a special application within the general IPR framework will be discussed.
This is intended to test the IPR capabilities in an extreme environment, where
no input pattern is available and the system only has access to the user actions to produce
a hypothesis. Specifically, we will focus here on providing assistance in the
problem of text generation.RodrÃguez Ruiz, L. (2010). Interactive Pattern Recognition applied to Natural Language Processing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8479Palanci
Image speech combination for interactive computer assisted transcription of handwritten documents
[EN] Handwritten document transcription aims to obtain the contents of a document to provide efficient information access to, among other, digitised historical documents. The increasing number of historical documents published by libraries and archives makes this an important task. In this context, the use of image processing and understanding techniques in conjunction with assistive technologies reduces the time and human effort required for obtaining the final perfect transcription. The assistive transcription system proposes a hypothesis, usually derived from a recognition process of the handwritten text image. Then, the professional transcriber feedback can be used to obtain an improved hypothesis and speed-up the final transcription. In this framework, a speech signal corresponding to the dictation of the handwritten text can be used as an additional source of information. This multimodal approach, that combines the image of the handwritten text with the speech of the dictation of its contents, could make better the hypotheses (initial and improved) offered to the transcriber. In this paper we study the feasibility of a multimodal interactive transcription system for an assistive paradigm known as Computer Assisted Transcription of Text Images. Different techniques are tested for obtaining the multimodal combination in this framework. The use of the proposed multimodal approach reveals a significant reduction of transcription effort with some multimodal combination techniques, allowing for a faster transcription process.Work partially supported by projects READ-674943 (European Union's H2020), SmartWays-RTC-2014-1466-4 (MINECO, Spain), and CoMUN-HaT-TIN2015-70924-C2-1-R (MINECO/FEDER), and by Generalitat Valenciana (GVA), Spain under reference PROMETEOII/2014/030.Granell, E.; Romero, V.; MartÃnez-Hinarejos, C. (2019). Image speech combination for interactive computer assisted transcription of handwritten documents. Computer Vision and Image Understanding. 180:74-83. https://doi.org/10.1016/j.cviu.2019.01.009S748318
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Recent advancements in surgical computer vision applications have been driven
by fully-supervised methods, primarily using only visual data. These methods
rely on manually annotated surgical videos to predict a fixed set of object
categories, limiting their generalizability to unseen surgical procedures and
downstream tasks. In this work, we put forward the idea that the surgical video
lectures available through open surgical e-learning platforms can provide
effective supervisory signals for multi-modal representation learning without
relying on manual annotations. We address the surgery-specific linguistic
challenges present in surgical video lectures by employing multiple
complementary automatic speech recognition systems to generate text
transcriptions. We then present a novel method, SurgVLP - Surgical Vision
Language Pre-training, for multi-modal representation learning. SurgVLP
constructs a new contrastive learning objective to align video clip embeddings
with the corresponding multiple text embeddings by bringing them together
within a joint latent space. To effectively show the representation capability
of the learned joint latent space, we introduce several vision-and-language
tasks for surgery, such as text-based video retrieval, temporal activity
grounding, and video captioning, as benchmarks for evaluation. We further
demonstrate that without using any labeled ground truth, our approach can be
employed for traditional vision-only surgical downstream tasks, such as
surgical tool, phase, and triplet recognition. The code will be made available
at https://github.com/CAMMA-public/SurgVL
Audio-visual modelling in a clinical setting
Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow
Audio-visual modelling in a clinical setting
Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow
How affects can perturbe the automatic speech recognition of domotic interactions
International audienceIn Smart Home, the vocal home automation orders, for comfort purposes, or assistive devoted, have been pointed as the more relevant interaction for ambient assisted living. Even if the orders are very strictly formulated, when they are daily used (directed to the smart home, or to a robot mediator), they become often pronounced with various affects. In this paper we have evaluated how some state of the art ASR systems shut down with expressive orders, acted or spontaneous, and how the ASR training with neutral and/or acted and/or spontaneous expressive commands corpus can greatly modify the ASR performances
DMRN+16: Digital Music Research Network One-day Workshop 2021
DMRN+16: Digital Music Research Network One-day Workshop 2021 Queen Mary University of London Tuesday 21st December 2021 Keynote speakers Keynote 1. Prof. Sophie Scott -Director, Institute of Cognitive Neuroscience, UCL. Title: "Sound on the brain - insights from functional neuroimaging and neuroanatomy" Abstract In this talk I will use functional imaging and models of primate neuroanatomy to explore how sound is processed in the human brain. I will demonstrate that sound is represented cortically in different parallel streams. I will expand this to show how this can impact on the concept of auditory perception, which arguably incorporates multiple kinds of distinct perceptual processes. I will address the roles that subcortical processes play in this, and also the contributions from hemispheric asymmetries. Keynote 2: Prof. Gus Xia - Assistant Professor at NYU Shanghai Title: "Learning interpretable music representations: from human stupidity to artificial intelligence" Abstract Gus has been leading the Music X Lab in developing intelligent systems that help people better compose and learn music. In this talk, he will show us the importance of music representation for both humans and machines, and how to learn better music representations via the design of inductive bias. Once we got interpretable music representations, the potential applications are limitless
VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting
The performance of the keyword spotting (KWS) system based on audio modality,
commonly measured in false alarms and false rejects, degrades significantly
under the far field and noisy conditions. Therefore, audio-visual keyword
spotting, which leverages complementary relationships over multiple modalities,
has recently gained much attention. However, current studies mainly focus on
combining the exclusively learned representations of different modalities,
instead of exploring the modal relationships during each respective modeling.
In this paper, we propose a novel visual modality enhanced end-to-end KWS
framework (VE-KWS), which fuses audio and visual modalities from two aspects.
The first one is utilizing the speaker location information obtained from the
lip region in videos to assist the training of multi-channel audio beamformer.
By involving the beamformer as an audio enhancement module, the acoustic
distortions, caused by the far field or noisy environments, could be
significantly suppressed. The other one is conducting cross-attention between
different modalities to capture the inter-modal relationships and help the
representation learning of each modality. Experiments on the MSIP challenge
corpus show that our proposed model achieves 2.79% false rejection rate and
2.95% false alarm rate on the Eval set, resulting in a new SOTA performance
compared with the top-ranking systems in the ICASSP2022 MISP challenge.Comment: 5 pages. Accepted at ICASSP202
- …